Optimizers
What
Algorithms that update model weights using gradients. Gradient Descent is the simplest; modern optimizers add momentum and adaptive learning rates.
Common optimizers
SGD with Momentum
velocity = momentum × velocity - lr × gradient
weight += velocity
Momentum accumulates past gradients → smoother, faster convergence.
Adam (Adaptive Moment Estimation)
The default. Combines momentum with per-parameter adaptive learning rates.
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)- Maintains running mean and variance of gradients
- Each parameter gets its own effective learning rate
- Works well out of the box
AdamW
Adam with proper weight decay (L2 regularization). Preferred over Adam in modern practice.
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)Which to use
| Optimizer | When |
|---|---|
| AdamW | Default for most deep learning |
| SGD + momentum | When you want to tune for best performance (vision) |
| Adam | Quick experiments, when AdamW isn’t available |
Learning rate schedules
from torch.optim.lr_scheduler import CosineAnnealingLR, OneCycleLR
scheduler = CosineAnnealingLR(optimizer, T_max=num_epochs)
# or
scheduler = OneCycleLR(optimizer, max_lr=1e-3, steps_per_epoch=len(loader), epochs=num_epochs)