PCA

Principal Component Analysis — the most common dimensionality reduction technique.

What

Find the directions (principal components) along which data varies the most. Project data onto fewer dimensions while preserving maximum variance.

How it works

  1. Standardize features (zero mean, unit variance)
  2. Compute covariance matrix
  3. Find eigenvectors (directions) and eigenvalues (variance explained)
  4. Keep top-k eigenvectors → project data into k dimensions
from sklearn.decomposition import PCA
 
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
 
# How much variance each component captures
print(pca.explained_variance_ratio_)

Use cases

  • Visualization: reduce to 2-3D for plotting
  • Noise reduction: small components are often noise
  • Speed up training: fewer features = faster models
  • Decorrelation: PCA components are uncorrelated

Choosing number of components

# Keep enough to explain 95% of variance
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X)
print(f"Kept {pca.n_components_} components")

Limitations

  • Only captures linear relationships
  • Components are hard to interpret
  • For nonlinear: use t-SNE (visualization) or UMAP