Optimizers

What

Algorithms that update model weights using gradients. Gradient Descent is the simplest; modern optimizers add momentum and adaptive learning rates.

Common optimizers

SGD with Momentum

velocity = momentum × velocity - lr × gradient
weight += velocity

Momentum accumulates past gradients → smoother, faster convergence.

Adam (Adaptive Moment Estimation)

The default. Combines momentum with per-parameter adaptive learning rates.

optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
  • Maintains running mean and variance of gradients
  • Each parameter gets its own effective learning rate
  • Works well out of the box

AdamW

Adam with proper weight decay (L2 regularization). Preferred over Adam in modern practice.

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)

Which to use

OptimizerWhen
AdamWDefault for most deep learning
SGD + momentumWhen you want to tune for best performance (vision)
AdamQuick experiments, when AdamW isn’t available

Learning rate schedules

from torch.optim.lr_scheduler import CosineAnnealingLR, OneCycleLR
 
scheduler = CosineAnnealingLR(optimizer, T_max=num_epochs)
# or
scheduler = OneCycleLR(optimizer, max_lr=1e-3, steps_per_epoch=len(loader), epochs=num_epochs)