Regularization

What

Adding a penalty for model complexity to prevent overfitting. Forces the model to find simpler patterns.

Methods

L1 (Lasso) — sparsity

Adds |weights| to the loss. Drives some weights to exactly zero → automatic feature selection.

L2 (Ridge) — small weights

Adds weights² to the loss. Shrinks all weights toward zero but doesn’t eliminate any.

Elastic Net — both

Combines L1 and L2. Best of both worlds.

from sklearn.linear_model import Ridge, Lasso, ElasticNet
 
# alpha controls regularization strength
ridge = Ridge(alpha=1.0)
lasso = Lasso(alpha=0.1)   # some coefficients will be exactly 0
elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)

Deep learning regularization

  • Dropout: randomly zero out neurons during training → forces redundancy
  • Weight decay: L2 regularization in the optimizer
  • Early stopping: stop training when validation loss stops improving
  • Data augmentation: artificially increase training data variety
  • Batch normalization: stabilizes training, mild regularization effect

Intuition

A model that fits training data perfectly probably memorized noise. Regularization says: “find patterns, but keep it simple.” The penalty trades a little training accuracy for much better generalization.