K-Means Clustering

What

Partition data into k groups by iteratively assigning points to nearest centroid and updating centroids.

Algorithm

  1. Pick k random centroids
  2. Assign each point to nearest centroid
  3. Recompute centroids as mean of assigned points
  4. Repeat 2-3 until convergence
from sklearn.cluster import KMeans
 
model = KMeans(n_clusters=3, n_init=10)
model.fit(X)
labels = model.labels_          # cluster assignment for each point
centers = model.cluster_centers_ # centroid coordinates

Choosing k

  • Elbow method: plot inertia (within-cluster sum of squares) vs k, pick the “elbow”
  • Silhouette score: measures how well points fit their cluster vs other clusters
from sklearn.metrics import silhouette_score
 
for k in range(2, 11):
    model = KMeans(n_clusters=k).fit(X)
    print(f"k={k}, silhouette={silhouette_score(X, model.labels_):.3f}")

Limitations

  • Must specify k upfront
  • Assumes spherical, equally-sized clusters
  • Sensitive to initialization (use n_init=10)
  • Sensitive to scale → use Feature Scaling

Alternatives

  • DBSCAN: finds arbitrary-shaped clusters, doesn’t need k
  • Hierarchical clustering: produces a dendrogram, no k needed
  • Gaussian Mixture Models: soft assignments (probabilities)