PCA
Principal Component Analysis — the most common dimensionality reduction technique.
What
Find the directions (principal components) along which data varies the most. Project data onto fewer dimensions while preserving maximum variance.
How it works
- Standardize features (zero mean, unit variance)
- Compute covariance matrix
- Find eigenvectors (directions) and eigenvalues (variance explained)
- Keep top-k eigenvectors → project data into k dimensions
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
# How much variance each component captures
print(pca.explained_variance_ratio_)Use cases
- Visualization: reduce to 2-3D for plotting
- Noise reduction: small components are often noise
- Speed up training: fewer features = faster models
- Decorrelation: PCA components are uncorrelated
Choosing number of components
# Keep enough to explain 95% of variance
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X)
print(f"Kept {pca.n_components_} components")Limitations
- Only captures linear relationships
- Components are hard to interpret
- For nonlinear: use t-SNE (visualization) or UMAP
Links
- Eigenvalues and Eigenvectors — the math behind PCA
- Matrix Decomposition — SVD computes PCA efficiently
- Feature Scaling — must standardize before PCA
- K-Means Clustering — often used together