Gradient Descent
What
The algorithm that trains virtually every ML model:
repeat:
1. Compute loss on current predictions
2. Compute gradient of loss w.r.t. each weight
3. Update: weight = weight - learning_rate × gradient
Why it matters
This is how models learn. Every neural network, linear regression, and logistic regression uses some form of gradient descent.
Key ideas
Variants
| Variant | Batch size | Tradeoff |
|---|---|---|
| Batch GD | All data | Stable but slow, needs all data in memory |
| Stochastic GD (SGD) | 1 sample | Noisy but fast, can escape local minima |
| Mini-batch GD | 32-512 samples | Best of both — the standard in practice |
Learning rate
- Too high: overshoots, loss oscillates or diverges
- Too low: converges painfully slowly
- Learning rate schedules: start high, decay over time
- Adaptive methods (Adam, RMSprop): per-parameter learning rates
Beyond vanilla
- Momentum: accumulate gradient history to smooth updates
- Adam: adaptive learning rate + momentum — the default optimizer
- Learning rate warmup: start low, ramp up, then decay
In code (conceptual)
for epoch in range(num_epochs):
for X_batch, y_batch in dataloader:
predictions = model(X_batch)
loss = loss_fn(predictions, y_batch)
loss.backward() # compute gradients
optimizer.step() # update weights
optimizer.zero_grad() # reset gradients