#11 K-Means Clustering

K-Means is an unsupervised machine learning algorithm used to group similar data points into clusters.
It divides data into K different clusters based on similarity.

K-Means helps the model:

group similar data points
identify hidden patterns
perform customer segmentation
organize unlabeled data

What is Clustering?

Clustering means... grouping similar data points together.

Example:

customers with similar shopping habits
similar documents
similar images

What is K in K-Means?

K represents no. of clusters. Suppose there is k = 3 then there is 3 clusters.

How K-Means Works

Step 1 — Choose K

Select no. of clusters. suppose k=3.

Step 2 — Initialize Centroids

Randomly choose cluster centres these are called Centroids.

Step 3 — Calculate Distance

distance formula : Eucledian distance

$$\mathbf{d=\sqrt{(x_2-x_1)^2+(y_2-y_1)^2}}$$

Step 4 — Update Centroids

New Centroid formula:

$$\mathbf{C=\frac{1}{n}\sum_{i=1}^{n}x_i}$$

The centroid becomes the mean of all points in the cluster.

Step 5 — Repeat

Repeat - Assign points, update centroids - until centroids stop changing.

Example Dataset

Point	X	Y
A	1	2
B	2	3
C	8	9
D	9	10

suppose we choose k=2. So there will be 2 clusters.

Step 1 — Initialize Centroids

choose 2 random centroids. suppose C1 = (1,2), C2 = (8,9)

Step 2 — Calculate Distance of Each Point

We will use Eucledian Distance

Distance from Point A(1,2)

Distance from C1

$$\mathbf{d(A,C_1)=\sqrt{(1-1)^2+(2-2)^2}} = \mathbf{0}$$

Distance from C2

$$\mathbf{d(A,C_2)=\sqrt{(8-1)^2+(9-2)^2}} $$

$$=\sqrt{49+49}=\sqrt{98}\approx9.89$$

Nearest Centroid:

$$\mathbf{A \rightarrow C_1}$$

In similar way calculate for other points

and form clusters...

Point	d(C₁)	d(C₂)	Nearest Centroid	Cluster
B(2,3)	√2 ≈ 1.41	√72 ≈ 8.48	C₁	Cluster 1
C(8,9)	√98 ≈ 9.89	0	C₂	Cluster 2
D(9,10)	√128 ≈ 11.31	√2 ≈ 1.41	C₂	Cluster 2

Step 3 — Update Centroids

New Centroid of Cluster 1

$$\mathbf{C_1=\left(\frac{1+2}{2},\frac{2+3}{2}\right)} $$

$$=\mathbf{C_1=(1.5,2.5)}$$

New Centroid of Cluster 2

$$\mathbf{C_2=\left(\frac{8+9}{2},\frac{9+10}{2}\right)} $$

$$=\mathbf{C_2=(8.5,9.5)}$$

Final Centroids

$$\mathbf{C_1=(1.5,2.5)}, \mathbf{C_2=(8.5,9.5)}$$

Objective Function

K-Means minimizes:

Within Cluster Sum of Squares (WCSS)

$$\mathbf{WCSS=\sum_{i=1}^{K}\sum_{x_j \in C_i}||x_j-\mu_i||^2}$$

$$\mathbf{C_i = \text{cluster}}$$

$$\mathbf{\mu_i = \text{centroid}}$$

Advantages of K-means

simple and fast
easy to implement
works well on large datasets
efficient clustering algorithm

Disadvantages

sensitive to outliers
choosing K is difficult
may converge to local minima
works poorly with irregular clusters

Applications of K-Means

Customer segmentation
Image compression
Recommendation systems
Market analysis
Document clustering

Elbow Method

Used to find optimal K:

The graph plots: K vs WCSS

K = no. of clusters

WCSS = Within Cluster Sum of Squares

The “elbow point” gives the best number of clusters.

Python Example

from sklearn.cluster import KMeans
import numpy as np

# Dataset
X = np.array([
    [1,2],
    [2,3],
    [8,9],
    [9,10]
])

# Create model
model = KMeans(n_clusters=2)

# Train
model.fit(X)

# Cluster labels
print(model.labels_)

Conclusion

K-Means is a simple and efficient unsupervised machine learning algorithm used to group similar data points into clusters based on distance from centroids. It repeatedly assigns points to the nearest centroid and updates centroid positions until stable clusters are formed. Because of its speed and simplicity, K-Means is widely used in customer segmentation, recommendation systems, image compression, and data analysis, although its performance depends on choosing a suitable value of (K) and good initial centroids.

#11 K-Means Clustering

What is Clustering?

What is K in K-Means?