Gradient Descent Variants
Goal: Implement SGD, Momentum, RMSProp, and Adam from scratch. Visualize how they navigate a loss surface differently.
Prerequisites: Gradient Descent, Optimizers, Gradient
The Optimization Problem
All optimizers do the same thing: update parameters to reduce loss. They differ in how they use the gradient.
We’ll test them on the Rosenbrock function — a classic optimization benchmark:
Minimum at . The long, narrow valley makes it hard for naive methods.
import numpy as np
import matplotlib.pyplot as plt
def rosenbrock(x, y):
return (1 - x)**2 + 100 * (y - x**2)**2
def rosenbrock_grad(x, y):
dx = -2 * (1 - x) - 400 * x * (y - x**2)
dy = 200 * (y - x**2)
return np.array([dx, dy])1. Vanilla Gradient Descent
def gd(grad_fn, start, lr=0.001, steps=5000):
path = [start.copy()]
theta = start.copy()
for _ in range(steps):
g = grad_fn(*theta)
theta -= lr * g
path.append(theta.copy())
return np.array(path)Problem: fixed step size. Too small = slow. Too large = oscillation in narrow valleys.
2. SGD with Momentum
Accumulate a velocity — like a ball rolling downhill:
def sgd_momentum(grad_fn, start, lr=0.001, beta=0.9, steps=5000):
path = [start.copy()]
theta = start.copy()
v = np.zeros_like(theta)
for _ in range(steps):
g = grad_fn(*theta)
v = beta * v + g
theta -= lr * v
path.append(theta.copy())
return np.array(path)Momentum smooths out oscillations and accelerates along consistent gradient directions.
3. RMSProp
Adapt the learning rate per-parameter using a running average of squared gradients:
def rmsprop(grad_fn, start, lr=0.01, beta=0.9, eps=1e-8, steps=5000):
path = [start.copy()]
theta = start.copy()
s = np.zeros_like(theta)
for _ in range(steps):
g = grad_fn(*theta)
s = beta * s + (1 - beta) * g**2
theta -= lr * g / (np.sqrt(s) + eps)
path.append(theta.copy())
return np.array(path)Parameters with large gradients get smaller steps. Parameters with small gradients get larger steps. This handles the narrow valley problem.
4. Adam
Combines momentum (1st moment) and RMSProp (2nd moment) with bias correction:
def adam(grad_fn, start, lr=0.01, beta1=0.9, beta2=0.999, eps=1e-8, steps=5000):
path = [start.copy()]
theta = start.copy()
m = np.zeros_like(theta)
v = np.zeros_like(theta)
for t in range(1, steps + 1):
g = grad_fn(*theta)
m = beta1 * m + (1 - beta1) * g
v = beta2 * v + (1 - beta2) * g**2
m_hat = m / (1 - beta1**t) # bias correction
v_hat = v / (1 - beta2**t)
theta -= lr * m_hat / (np.sqrt(v_hat) + eps)
path.append(theta.copy())
return np.array(path)Why bias correction?
, so early estimates are biased toward zero. Dividing by corrects this. After many steps, and correction vanishes.
Compare All Four
start = np.array([-1.0, 1.0])
paths = {
"GD": gd(rosenbrock_grad, start, lr=0.001, steps=5000),
"Momentum": sgd_momentum(rosenbrock_grad, start, lr=0.001, beta=0.9, steps=5000),
"RMSProp": rmsprop(rosenbrock_grad, start, lr=0.005, steps=5000),
"Adam": adam(rosenbrock_grad, start, lr=0.01, steps=5000),
}
# Plot
x = np.linspace(-2, 2, 400)
y = np.linspace(-1, 3, 400)
X, Y = np.meshgrid(x, y)
Z = rosenbrock(X, Y)
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
for ax, (name, path) in zip(axes.ravel(), paths.items()):
ax.contour(X, Y, Z, levels=np.logspace(-1, 3.5, 30), cmap="viridis", alpha=0.5)
ax.plot(path[:, 0], path[:, 1], "r-", linewidth=0.8, alpha=0.7)
ax.plot(path[0, 0], path[0, 1], "ro", markersize=8)
ax.plot(1, 1, "g*", markersize=15) # optimum
ax.set_title(f"{name} ({len(path)} steps)")
ax.set_xlim(-2, 2); ax.set_ylim(-1, 3)
plt.tight_layout()
plt.show()What you’ll see
- GD: Slow, follows the valley walls
- Momentum: Overshoots but recovers, reaches minimum faster
- RMSProp: Adapts step size, handles the valley width difference
- Adam: Fastest convergence, combines benefits of both
Loss Convergence Comparison
fig, ax = plt.subplots(figsize=(10, 5))
for name, path in paths.items():
losses = [rosenbrock(p[0], p[1]) for p in path]
ax.plot(losses, label=name, alpha=0.8)
ax.set_xlabel("Step"); ax.set_ylabel("f(x,y)")
ax.set_yscale("log")
ax.legend()
ax.set_title("Convergence comparison on Rosenbrock")
plt.show()On a Neural Network
Apply these optimizers to MNIST classification:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
digits = load_digits()
X, y = digits.data, digits.target
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
n_classes = 10
y_train_oh = np.eye(n_classes)[y_train]
def relu(z): return np.maximum(0, z)
def softmax(z):
e = np.exp(z - z.max(axis=1, keepdims=True))
return e / e.sum(axis=1, keepdims=True)
def train_with_optimizer(X, y_oh, optimizer="sgd", lr=0.01, epochs=200, hidden=64):
"""Train a 2-layer net with the specified optimizer."""
d, h, c = X.shape[1], hidden, y_oh.shape[1]
n = len(X)
W1 = np.random.randn(d, h) * np.sqrt(2/d)
b1 = np.zeros(h)
W2 = np.random.randn(h, c) * np.sqrt(2/h)
b2 = np.zeros(c)
# Optimizer state
params = [W1, b1, W2, b2]
m = [np.zeros_like(p) for p in params]
v = [np.zeros_like(p) for p in params]
losses = []
for epoch in range(1, epochs + 1):
# Forward
z1 = X @ W1 + b1
a1 = relu(z1)
z2 = a1 @ W2 + b2
probs = softmax(z2)
loss = -np.sum(y_oh * np.log(probs + 1e-15)) / n
losses.append(loss)
# Backward
dz2 = (probs - y_oh) / n
dW2 = a1.T @ dz2
db2 = dz2.sum(axis=0)
da1 = dz2 @ W2.T
dz1 = da1 * (z1 > 0)
dW1 = X.T @ dz1
db1 = dz1.sum(axis=0)
grads = [dW1, db1, dW2, db2]
# Update
for i in range(4):
if optimizer == "sgd":
params[i] -= lr * grads[i]
elif optimizer == "momentum":
m[i] = 0.9 * m[i] + grads[i]
params[i] -= lr * m[i]
elif optimizer == "adam":
m[i] = 0.9 * m[i] + 0.1 * grads[i]
v[i] = 0.999 * v[i] + 0.001 * grads[i]**2
m_hat = m[i] / (1 - 0.9**epoch)
v_hat = v[i] / (1 - 0.999**epoch)
params[i] -= lr * m_hat / (np.sqrt(v_hat) + 1e-8)
W1, b1, W2, b2 = params
return losses, params
fig, ax = plt.subplots(figsize=(10, 5))
for opt in ["sgd", "momentum", "adam"]:
losses, _ = train_with_optimizer(X_train, y_train_oh, optimizer=opt, lr=0.01)
ax.plot(losses, label=opt)
ax.set_xlabel("Epoch"); ax.set_ylabel("Loss")
ax.legend(); ax.set_title("Optimizer comparison on MNIST")
plt.show()When to Use What
| Optimizer | Best for | Watch out for |
|---|---|---|
| SGD | When you have time and want generalization | Slow, needs LR tuning |
| SGD+Momentum | Standard choice for CNNs | Overshooting |
| RMSProp | RNNs, non-stationary problems | Can diverge with bad |
| Adam | Default for most tasks | May generalize worse than SGD on some problems |
| AdamW | Transformers, modern DL | Adam + proper weight decay |
Exercises
-
Nesterov momentum: Look ahead before computing the gradient: . Implement it and compare convergence.
-
Learning rate warmup: Start with and linearly increase to target lr over the first 100 steps. Then cosine decay. Plot the schedule and test on MNIST.
-
AdamW: Implement weight decay separately from the gradient: . Compare with L2 regularization in the loss.
-
Saddle point test: Optimize (a saddle at origin). Start at . Which optimizer escapes the saddle fastest?
Next: 07 - Backpropagation Step by Step — trace the math through a tiny network by hand.