Gradient Descent

What

The algorithm that trains virtually every ML model:

repeat:
    1. Compute loss on current predictions
    2. Compute gradient of loss w.r.t. each weight
    3. Update: weight = weight - learning_rate × gradient

Why it matters

This is how models learn. Every neural network, linear regression, and logistic regression uses some form of gradient descent.

Key ideas

Variants

VariantBatch sizeTradeoff
Batch GDAll dataStable but slow, needs all data in memory
Stochastic GD (SGD)1 sampleNoisy but fast, can escape local minima
Mini-batch GD32-512 samplesBest of both — the standard in practice

Learning rate

  • Too high: overshoots, loss oscillates or diverges
  • Too low: converges painfully slowly
  • Learning rate schedules: start high, decay over time
  • Adaptive methods (Adam, RMSprop): per-parameter learning rates

Beyond vanilla

  • Momentum: accumulate gradient history to smooth updates
  • Adam: adaptive learning rate + momentum — the default optimizer
  • Learning rate warmup: start low, ramp up, then decay

In code (conceptual)

for epoch in range(num_epochs):
    for X_batch, y_batch in dataloader:
        predictions = model(X_batch)
        loss = loss_fn(predictions, y_batch)
        loss.backward()           # compute gradients
        optimizer.step()          # update weights
        optimizer.zero_grad()     # reset gradients