#11 K-Means Clustering

K-Means is an unsupervised machine learning algorithm used to group similar data points into clusters.
It divides data into K different clusters based on similarity.
K-Means helps the model:
group similar data points
identify hidden patterns
perform customer segmentation
organize unlabeled data
What is Clustering?
Clustering means... grouping similar data points together.
Example:
customers with similar shopping habits
similar documents
similar images
What is K in K-Means?
K represents no. of clusters. Suppose there is k = 3 then there is 3 clusters.
How K-Means Works
Step 1 — Choose K
Select no. of clusters. suppose k=3.
Step 2 — Initialize Centroids
Randomly choose cluster centres these are called Centroids.
Step 3 — Calculate Distance
distance formula : Eucledian distance
$$\mathbf{d=\sqrt{(x_2-x_1)^2+(y_2-y_1)^2}}$$
Step 4 — Update Centroids
New Centroid formula:
$$\mathbf{C=\frac{1}{n}\sum_{i=1}^{n}x_i}$$
The centroid becomes the mean of all points in the cluster.
Step 5 — Repeat
Repeat - Assign points, update centroids - until centroids stop changing.
Example Dataset
| Point | X | Y |
|---|---|---|
| A | 1 | 2 |
| B | 2 | 3 |
| C | 8 | 9 |
| D | 9 | 10 |
suppose we choose k=2. So there will be 2 clusters.
Step 1 — Initialize Centroids
choose 2 random centroids. suppose C1 = (1,2), C2 = (8,9)
Step 2 — Calculate Distance of Each Point
We will use Eucledian Distance
Distance from Point A(1,2)
Distance from C1
$$\mathbf{d(A,C_1)=\sqrt{(1-1)^2+(2-2)^2}} = \mathbf{0}$$
Distance from C2
$$\mathbf{d(A,C_2)=\sqrt{(8-1)^2+(9-2)^2}} $$
$$=\sqrt{49+49}=\sqrt{98}\approx9.89$$
Nearest Centroid:
$$\mathbf{A \rightarrow C_1}$$
In similar way calculate for other points
and form clusters...
| Point | d(C₁) | d(C₂) | Nearest Centroid | Cluster |
|---|---|---|---|---|
| B(2,3) | √2 ≈ 1.41 | √72 ≈ 8.48 | C₁ | Cluster 1 |
| C(8,9) | √98 ≈ 9.89 | 0 | C₂ | Cluster 2 |
| D(9,10) | √128 ≈ 11.31 | √2 ≈ 1.41 | C₂ | Cluster 2 |
Step 3 — Update Centroids
New Centroid of Cluster 1
$$\mathbf{C_1=\left(\frac{1+2}{2},\frac{2+3}{2}\right)} $$
$$=\mathbf{C_1=(1.5,2.5)}$$
New Centroid of Cluster 2
$$\mathbf{C_2=\left(\frac{8+9}{2},\frac{9+10}{2}\right)} $$
$$=\mathbf{C_2=(8.5,9.5)}$$
Final Centroids
$$\mathbf{C_1=(1.5,2.5)}, \mathbf{C_2=(8.5,9.5)}$$
Objective Function
K-Means minimizes:
Within Cluster Sum of Squares (WCSS)
$$\mathbf{WCSS=\sum_{i=1}^{K}\sum_{x_j \in C_i}||x_j-\mu_i||^2}$$
$$\mathbf{C_i = \text{cluster}}$$
$$\mathbf{\mu_i = \text{centroid}}$$
Advantages of K-means
simple and fast
easy to implement
works well on large datasets
efficient clustering algorithm
Disadvantages
sensitive to outliers
choosing K is difficult
may converge to local minima
works poorly with irregular clusters
Applications of K-Means
Customer segmentation
Image compression
Recommendation systems
Market analysis
Document clustering
Elbow Method
Used to find optimal K:
The graph plots: K vs WCSS
K = no. of clusters
WCSS = Within Cluster Sum of Squares
The “elbow point” gives the best number of clusters.
Python Example
from sklearn.cluster import KMeans
import numpy as np
# Dataset
X = np.array([
[1,2],
[2,3],
[8,9],
[9,10]
])
# Create model
model = KMeans(n_clusters=2)
# Train
model.fit(X)
# Cluster labels
print(model.labels_)
Conclusion
K-Means is a simple and efficient unsupervised machine learning algorithm used to group similar data points into clusters based on distance from centroids. It repeatedly assigns points to the nearest centroid and updates centroid positions until stable clusters are formed. Because of its speed and simplicity, K-Means is widely used in customer segmentation, recommendation systems, image compression, and data analysis, although its performance depends on choosing a suitable value of (K) and good initial centroids.





