Gradient Descent

What

The algorithm that trains virtually every ML model:

repeat:
    1. Compute loss on current predictions
    2. Compute gradient of loss w.r.t. each weight
    3. Update: weight = weight - learning_rate × gradient

Why it matters

This is how models learn. Every neural network, linear regression, and logistic regression uses some form of gradient descent.

Key ideas

Variants

Variant	Batch size	Tradeoff
Batch GD	All data	Stable but slow, needs all data in memory
Stochastic GD (SGD)	1 sample	Noisy but fast, can escape local minima
Mini-batch GD	32-512 samples	Best of both — the standard in practice

Learning rate

Too high: overshoots, loss oscillates or diverges
Too low: converges painfully slowly
Learning rate schedules: start high, decay over time
Adaptive methods (Adam, RMSprop): per-parameter learning rates

Beyond vanilla

Momentum: accumulate gradient history to smooth updates
Adam: adaptive learning rate + momentum — the default optimizer
Learning rate warmup: start low, ramp up, then decay

In code (conceptual)

for epoch in range(num_epochs):
    for X_batch, y_batch in dataloader:
        predictions = model(X_batch)
        loss = loss_fn(predictions, y_batch)
        loss.backward()           # compute gradients
        optimizer.step()          # update weights
        optimizer.zero_grad()     # reset gradients

AI/ML Notes

Explorer

Gradient Descent

Gradient Descent

What

Why it matters

Key ideas

Variants

Learning rate

Beyond vanilla

In code (conceptual)

Links

Graph View

Table of Contents

Backlinks