Optimizers

What

Algorithms that update model weights using gradients. Gradient Descent is the simplest; modern optimizers add momentum and adaptive learning rates.

Common optimizers

SGD with Momentum

velocity = momentum × velocity - lr × gradient
weight += velocity

Momentum accumulates past gradients → smoother, faster convergence.

Adam (Adaptive Moment Estimation)

The default. Combines momentum with per-parameter adaptive learning rates.

optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

Maintains running mean and variance of gradients
Each parameter gets its own effective learning rate
Works well out of the box

AdamW

Adam with proper weight decay (L2 regularization). Preferred over Adam in modern practice.

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)

Which to use

Optimizer	When
AdamW	Default for most deep learning
SGD + momentum	When you want to tune for best performance (vision)
Adam	Quick experiments, when AdamW isn’t available

Learning rate schedules

from torch.optim.lr_scheduler import CosineAnnealingLR, OneCycleLR
 
scheduler = CosineAnnealingLR(optimizer, T_max=num_epochs)
# or
scheduler = OneCycleLR(optimizer, max_lr=1e-3, steps_per_epoch=len(loader), epochs=num_epochs)

AI/ML Notes

Explorer

Optimizers

Optimizers

What

Common optimizers

SGD with Momentum

Adam (Adaptive Moment Estimation)

AdamW

Which to use

Learning rate schedules

Links

Graph View

Table of Contents

Backlinks