Skip to main content

Command Palette

Search for a command to run...

#11 K-Means Clustering

Updated
4 min read
#11 K-Means Clustering
A
Machine Learning Engineer and open-source developer focused on NLP, LLM applications, Retrieval-Augmented Generation (RAG), semantic search, and AI infrastructure. I enjoy building developer tools, portable AI systems, and production-ready ML pipelines using Python, FastAPI, FAISS, LangChain, TensorFlow, and PyTorch. Creator of: • RagBucket — portable executable RAG artifacts for Python • LazyTune — fast hyperparameter optimization library • AkBOT — AI portfolio chatbot using RAG Contributor to open-source projects including NumPy and LocalStack.

K-Means is an unsupervised machine learning algorithm used to group similar data points into clusters.
It divides data into K different clusters based on similarity.

K-Means helps the model:

  • group similar data points

  • identify hidden patterns

  • perform customer segmentation

  • organize unlabeled data

What is Clustering?

Clustering means... grouping similar data points together.

Example:

  • customers with similar shopping habits

  • similar documents

  • similar images

What is K in K-Means?

K represents no. of clusters. Suppose there is k = 3 then there is 3 clusters.

How K-Means Works

Step 1 — Choose K

Select no. of clusters. suppose k=3.

Step 2 — Initialize Centroids

Randomly choose cluster centres these are called Centroids.

Step 3 — Calculate Distance

distance formula : Eucledian distance

$$\mathbf{d=\sqrt{(x_2-x_1)^2+(y_2-y_1)^2}}$$

Step 4 — Update Centroids

New Centroid formula:

$$\mathbf{C=\frac{1}{n}\sum_{i=1}^{n}x_i}$$

The centroid becomes the mean of all points in the cluster.

Step 5 — Repeat

Repeat - Assign points, update centroids - until centroids stop changing.

Example Dataset

Point X Y
A 1 2
B 2 3
C 8 9
D 9 10

suppose we choose k=2. So there will be 2 clusters.

Step 1 — Initialize Centroids

choose 2 random centroids. suppose C1 = (1,2), C2 = (8,9)

Step 2 — Calculate Distance of Each Point

We will use Eucledian Distance

Distance from Point A(1,2)

Distance from C1

$$\mathbf{d(A,C_1)=\sqrt{(1-1)^2+(2-2)^2}} = \mathbf{0}$$

Distance from C2

$$\mathbf{d(A,C_2)=\sqrt{(8-1)^2+(9-2)^2}} $$

$$=\sqrt{49+49}=\sqrt{98}\approx9.89$$

Nearest Centroid:

$$\mathbf{A \rightarrow C_1}$$

In similar way calculate for other points

and form clusters...

Point d(C₁) d(C₂) Nearest Centroid Cluster
B(2,3) √2 ≈ 1.41 √72 ≈ 8.48 C₁ Cluster 1
C(8,9) √98 ≈ 9.89 0 C₂ Cluster 2
D(9,10) √128 ≈ 11.31 √2 ≈ 1.41 C₂ Cluster 2

Step 3 — Update Centroids

New Centroid of Cluster 1

$$\mathbf{C_1=\left(\frac{1+2}{2},\frac{2+3}{2}\right)} $$

$$=\mathbf{C_1=(1.5,2.5)}$$

New Centroid of Cluster 2

$$\mathbf{C_2=\left(\frac{8+9}{2},\frac{9+10}{2}\right)} $$

$$=\mathbf{C_2=(8.5,9.5)}$$

Final Centroids

$$\mathbf{C_1=(1.5,2.5)}, \mathbf{C_2=(8.5,9.5)}$$

Objective Function

K-Means minimizes:

Within Cluster Sum of Squares (WCSS)

$$\mathbf{WCSS=\sum_{i=1}^{K}\sum_{x_j \in C_i}||x_j-\mu_i||^2}$$

$$\mathbf{C_i = \text{cluster}}$$

$$\mathbf{\mu_i = \text{centroid}}$$

Advantages of K-means

  • simple and fast

  • easy to implement

  • works well on large datasets

  • efficient clustering algorithm

Disadvantages

  • sensitive to outliers

  • choosing K is difficult

  • may converge to local minima

  • works poorly with irregular clusters

Applications of K-Means

  • Customer segmentation

  • Image compression

  • Recommendation systems

  • Market analysis

  • Document clustering

Elbow Method

Used to find optimal K:

The graph plots: K vs WCSS

K = no. of clusters

WCSS = Within Cluster Sum of Squares

The “elbow point” gives the best number of clusters.

Python Example

from sklearn.cluster import KMeans
import numpy as np

# Dataset
X = np.array([
    [1,2],
    [2,3],
    [8,9],
    [9,10]
])

# Create model
model = KMeans(n_clusters=2)

# Train
model.fit(X)

# Cluster labels
print(model.labels_)

Conclusion

K-Means is a simple and efficient unsupervised machine learning algorithm used to group similar data points into clusters based on distance from centroids. It repeatedly assigns points to the nearest centroid and updates centroid positions until stable clusters are formed. Because of its speed and simplicity, K-Means is widely used in customer segmentation, recommendation systems, image compression, and data analysis, although its performance depends on choosing a suitable value of (K) and good initial centroids.

53 views