Gradient
What
The gradient is the vector of all partial derivatives. For a function f(x₁, x₂, …, xₙ):
∇f = [∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ]
Why it matters
The gradient points in the direction of steepest increase. To minimize a loss function, go in the opposite direction: negative gradient.
Geometric intuition
Think of a loss surface as a hilly landscape. You’re standing at a point and want to go downhill as fast as possible. The gradient tells you which direction is steepest uphill. Walk the opposite way.
In 2D, the gradient is a 2-element vector — an arrow on the surface. In 1000D (a small neural net), it’s a 1000-element vector — same idea, just impossible to visualize. The math doesn’t care about the dimension.
Key ideas
- Direction: gradient points “uphill” — negate it to go downhill (minimize loss)
- Magnitude: how steep the slope is — larger gradient = faster change
- Zero gradient: you’re at a flat point (minimum, maximum, or saddle point)
- In ML: gradient of loss w.r.t. all model weights → one update step
- In higher dimensions, most critical points are saddle points, not minima — this matters for optimization
Computing gradients
Analytical: derive the formula by hand (or let autograd do it). Exact and fast.
Numerical: approximate with finite differences. Slow but useful for checking.
import numpy as np
# L2 loss: L = sum((y - X @ w)^2)
# Analytical gradient: dL/dw = -2 * X.T @ (y - X @ w)
X = np.array([[1, 2], [3, 4], [5, 6]], dtype=float)
y = np.array([1, 2, 3], dtype=float)
w = np.array([0.1, 0.2], dtype=float)
residual = y - X @ w
grad_analytical = -2 * X.T @ residual
# Numerical gradient (finite differences) — for verification
eps = 1e-5
grad_numerical = np.zeros_like(w)
for i in range(len(w)):
w_plus = w.copy(); w_plus[i] += eps
w_minus = w.copy(); w_minus[i] -= eps
loss_plus = np.sum((y - X @ w_plus) ** 2)
loss_minus = np.sum((y - X @ w_minus) ** 2)
grad_numerical[i] = (loss_plus - loss_minus) / (2 * eps)
# These should match (up to floating point noise)
print(np.allclose(grad_analytical, grad_numerical)) # TrueNumerical gradients are O(n) forward passes for n parameters — way too slow for real models. That’s why we use Backpropagation (analytical gradients via the chain rule) in practice.