K-Means Clustering
What
Partition data into k groups by iteratively assigning points to nearest centroid and updating centroids.
Algorithm
- Pick k random centroids
- Assign each point to nearest centroid
- Recompute centroids as mean of assigned points
- Repeat 2-3 until convergence
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3, n_init=10)
model.fit(X)
labels = model.labels_ # cluster assignment for each point
centers = model.cluster_centers_ # centroid coordinatesChoosing k
- Elbow method: plot inertia (within-cluster sum of squares) vs k, pick the “elbow”
- Silhouette score: measures how well points fit their cluster vs other clusters
from sklearn.metrics import silhouette_score
for k in range(2, 11):
model = KMeans(n_clusters=k).fit(X)
print(f"k={k}, silhouette={silhouette_score(X, model.labels_):.3f}")Limitations
- Must specify k upfront
- Assumes spherical, equally-sized clusters
- Sensitive to initialization (use
n_init=10) - Sensitive to scale → use Feature Scaling
Alternatives
- DBSCAN: finds arbitrary-shaped clusters, doesn’t need k
- Hierarchical clustering: produces a dendrogram, no k needed
- Gaussian Mixture Models: soft assignments (probabilities)
Links
- PCA — often used before clustering for visualization
- Supervised vs Unsupervised Learning
- Feature Scaling