Gradient Descent Variants

Goal: Implement SGD, Momentum, RMSProp, and Adam from scratch. Visualize how they navigate a loss surface differently.

Prerequisites: Gradient Descent, Optimizers, Gradient

The Optimization Problem

All optimizers do the same thing: update parameters to reduce loss. They differ in how they use the gradient.

We’ll test them on the Rosenbrock function — a classic optimization benchmark:

$f (x, y) = (1 - x)^{2} + 100 (y - x^{2})^{2}$

Minimum at $(1, 1)$ . The long, narrow valley makes it hard for naive methods.

import numpy as np
import matplotlib.pyplot as plt
 
def rosenbrock(x, y):
    return (1 - x)**2 + 100 * (y - x**2)**2
 
def rosenbrock_grad(x, y):
    dx = -2 * (1 - x) - 400 * x * (y - x**2)
    dy = 200 * (y - x**2)
    return np.array([dx, dy])

1. Vanilla Gradient Descent

$θ_{t + 1} = θ_{t} - η \nabla f (θ_{t})$

def gd(grad_fn, start, lr=0.001, steps=5000):
    path = [start.copy()]
    theta = start.copy()
    for _ in range(steps):
        g = grad_fn(*theta)
        theta -= lr * g
        path.append(theta.copy())
    return np.array(path)

Problem: fixed step size. Too small = slow. Too large = oscillation in narrow valleys.

2. SGD with Momentum

Accumulate a velocity — like a ball rolling downhill:

$v_{t + 1} = β v_{t} + \nabla f (θ_{t})$ $θ_{t + 1} = θ_{t} - η v_{t + 1}$

def sgd_momentum(grad_fn, start, lr=0.001, beta=0.9, steps=5000):
    path = [start.copy()]
    theta = start.copy()
    v = np.zeros_like(theta)
    for _ in range(steps):
        g = grad_fn(*theta)
        v = beta * v + g
        theta -= lr * v
        path.append(theta.copy())
    return np.array(path)

Momentum smooths out oscillations and accelerates along consistent gradient directions.

3. RMSProp

Adapt the learning rate per-parameter using a running average of squared gradients:

$s_{t + 1} = β s_{t} + (1 - β) g_{t}^{2}$ $θ_{t + 1} = θ_{t} - \frac{η}{s _{t + 1} + ϵ} g_{t}$

def rmsprop(grad_fn, start, lr=0.01, beta=0.9, eps=1e-8, steps=5000):
    path = [start.copy()]
    theta = start.copy()
    s = np.zeros_like(theta)
    for _ in range(steps):
        g = grad_fn(*theta)
        s = beta * s + (1 - beta) * g**2
        theta -= lr * g / (np.sqrt(s) + eps)
        path.append(theta.copy())
    return np.array(path)

Parameters with large gradients get smaller steps. Parameters with small gradients get larger steps. This handles the narrow valley problem.

4. Adam

Combines momentum (1st moment) and RMSProp (2nd moment) with bias correction:

$m_{t + 1} = β_{1} m_{t} + (1 - β_{1}) g_{t}$ $v_{t + 1} = β_{2} v_{t} + (1 - β_{2}) g_{t}^{2}$ $\overset{m}{^} = \frac{m _{t + 1}}{1 - β _{1}^{t + 1}}, \overset{v}{^} = \frac{v _{t + 1}}{1 - β _{2}^{t + 1}}$ $θ_{t + 1} = θ_{t} - \frac{η}{v ^ + ϵ} \overset{m}{^}$

def adam(grad_fn, start, lr=0.01, beta1=0.9, beta2=0.999, eps=1e-8, steps=5000):
    path = [start.copy()]
    theta = start.copy()
    m = np.zeros_like(theta)
    v = np.zeros_like(theta)
    for t in range(1, steps + 1):
        g = grad_fn(*theta)
        m = beta1 * m + (1 - beta1) * g
        v = beta2 * v + (1 - beta2) * g**2
        m_hat = m / (1 - beta1**t)   # bias correction
        v_hat = v / (1 - beta2**t)
        theta -= lr * m_hat / (np.sqrt(v_hat) + eps)
        path.append(theta.copy())
    return np.array(path)

Why bias correction?

$m_{0} = 0$ , so early estimates are biased toward zero. Dividing by $(1 - β^{t})$ corrects this. After many steps, $β^{t} \to 0$ and correction vanishes.

Compare All Four

start = np.array([-1.0, 1.0])
 
paths = {
    "GD":       gd(rosenbrock_grad, start, lr=0.001, steps=5000),
    "Momentum": sgd_momentum(rosenbrock_grad, start, lr=0.001, beta=0.9, steps=5000),
    "RMSProp":  rmsprop(rosenbrock_grad, start, lr=0.005, steps=5000),
    "Adam":     adam(rosenbrock_grad, start, lr=0.01, steps=5000),
}
 
# Plot
x = np.linspace(-2, 2, 400)
y = np.linspace(-1, 3, 400)
X, Y = np.meshgrid(x, y)
Z = rosenbrock(X, Y)
 
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
for ax, (name, path) in zip(axes.ravel(), paths.items()):
    ax.contour(X, Y, Z, levels=np.logspace(-1, 3.5, 30), cmap="viridis", alpha=0.5)
    ax.plot(path[:, 0], path[:, 1], "r-", linewidth=0.8, alpha=0.7)
    ax.plot(path[0, 0], path[0, 1], "ro", markersize=8)
    ax.plot(1, 1, "g*", markersize=15)  # optimum
    ax.set_title(f"{name} ({len(path)} steps)")
    ax.set_xlim(-2, 2); ax.set_ylim(-1, 3)
plt.tight_layout()
plt.show()

What you’ll see

GD: Slow, follows the valley walls
Momentum: Overshoots but recovers, reaches minimum faster
RMSProp: Adapts step size, handles the valley width difference
Adam: Fastest convergence, combines benefits of both

Loss Convergence Comparison

fig, ax = plt.subplots(figsize=(10, 5))
for name, path in paths.items():
    losses = [rosenbrock(p[0], p[1]) for p in path]
    ax.plot(losses, label=name, alpha=0.8)
ax.set_xlabel("Step"); ax.set_ylabel("f(x,y)")
ax.set_yscale("log")
ax.legend()
ax.set_title("Convergence comparison on Rosenbrock")
plt.show()

On a Neural Network

Apply these optimizers to MNIST classification:

from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
 
digits = load_digits()
X, y = digits.data, digits.target
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 
n_classes = 10
y_train_oh = np.eye(n_classes)[y_train]
 
def relu(z):    return np.maximum(0, z)
def softmax(z):
    e = np.exp(z - z.max(axis=1, keepdims=True))
    return e / e.sum(axis=1, keepdims=True)
 
def train_with_optimizer(X, y_oh, optimizer="sgd", lr=0.01, epochs=200, hidden=64):
    """Train a 2-layer net with the specified optimizer."""
    d, h, c = X.shape[1], hidden, y_oh.shape[1]
    n = len(X)
 
    W1 = np.random.randn(d, h) * np.sqrt(2/d)
    b1 = np.zeros(h)
    W2 = np.random.randn(h, c) * np.sqrt(2/h)
    b2 = np.zeros(c)
 
    # Optimizer state
    params = [W1, b1, W2, b2]
    m = [np.zeros_like(p) for p in params]
    v = [np.zeros_like(p) for p in params]
 
    losses = []
    for epoch in range(1, epochs + 1):
        # Forward
        z1 = X @ W1 + b1
        a1 = relu(z1)
        z2 = a1 @ W2 + b2
        probs = softmax(z2)
 
        loss = -np.sum(y_oh * np.log(probs + 1e-15)) / n
        losses.append(loss)
 
        # Backward
        dz2 = (probs - y_oh) / n
        dW2 = a1.T @ dz2
        db2 = dz2.sum(axis=0)
        da1 = dz2 @ W2.T
        dz1 = da1 * (z1 > 0)
        dW1 = X.T @ dz1
        db1 = dz1.sum(axis=0)
 
        grads = [dW1, db1, dW2, db2]
 
        # Update
        for i in range(4):
            if optimizer == "sgd":
                params[i] -= lr * grads[i]
            elif optimizer == "momentum":
                m[i] = 0.9 * m[i] + grads[i]
                params[i] -= lr * m[i]
            elif optimizer == "adam":
                m[i] = 0.9 * m[i] + 0.1 * grads[i]
                v[i] = 0.999 * v[i] + 0.001 * grads[i]**2
                m_hat = m[i] / (1 - 0.9**epoch)
                v_hat = v[i] / (1 - 0.999**epoch)
                params[i] -= lr * m_hat / (np.sqrt(v_hat) + 1e-8)
 
        W1, b1, W2, b2 = params
 
    return losses, params
 
fig, ax = plt.subplots(figsize=(10, 5))
for opt in ["sgd", "momentum", "adam"]:
    losses, _ = train_with_optimizer(X_train, y_train_oh, optimizer=opt, lr=0.01)
    ax.plot(losses, label=opt)
ax.set_xlabel("Epoch"); ax.set_ylabel("Loss")
ax.legend(); ax.set_title("Optimizer comparison on MNIST")
plt.show()

When to Use What

Optimizer	Best for	Watch out for
SGD	When you have time and want generalization	Slow, needs LR tuning
SGD+Momentum	Standard choice for CNNs	Overshooting
RMSProp	RNNs, non-stationary problems	Can diverge with bad $β$
Adam	Default for most tasks	May generalize worse than SGD on some problems
AdamW	Transformers, modern DL	Adam + proper weight decay

Exercises

Nesterov momentum: Look ahead before computing the gradient: $g = \nabla f (θ - η β v)$ . Implement it and compare convergence.
Learning rate warmup: Start with $lr = 0$ and linearly increase to target lr over the first 100 steps. Then cosine decay. Plot the schedule and test on MNIST.
AdamW: Implement weight decay separately from the gradient: $θ_{t + 1} = (1 - λ) θ_{t} - lr \cdot \overset{m}{^} / (\overset{v}{^} + ϵ)$ . Compare with L2 regularization in the loss.
Saddle point test: Optimize $f (x, y) = x^{2} - y^{2}$ (a saddle at origin). Start at $(0.1, 0.1)$ . Which optimizer escapes the saddle fastest?

Next: 07 - Backpropagation Step by Step — trace the math through a tiny network by hand.

AI/ML Notes

Explorer

06 - Gradient Descent Variants