Gradient

What

The gradient is the vector of all partial derivatives. For a function f(x₁, x₂, …, xₙ):

∇f = [∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ]

Why it matters

The gradient points in the direction of steepest increase. To minimize a loss function, go in the opposite direction: negative gradient.

Geometric intuition

Think of a loss surface as a hilly landscape. You’re standing at a point and want to go downhill as fast as possible. The gradient tells you which direction is steepest uphill. Walk the opposite way.

In 2D, the gradient is a 2-element vector — an arrow on the surface. In 1000D (a small neural net), it’s a 1000-element vector — same idea, just impossible to visualize. The math doesn’t care about the dimension.

Key ideas

  • Direction: gradient points “uphill” — negate it to go downhill (minimize loss)
  • Magnitude: how steep the slope is — larger gradient = faster change
  • Zero gradient: you’re at a flat point (minimum, maximum, or saddle point)
  • In ML: gradient of loss w.r.t. all model weights → one update step
  • In higher dimensions, most critical points are saddle points, not minima — this matters for optimization

Computing gradients

Analytical: derive the formula by hand (or let autograd do it). Exact and fast.

Numerical: approximate with finite differences. Slow but useful for checking.

import numpy as np
 
# L2 loss: L = sum((y - X @ w)^2)
# Analytical gradient: dL/dw = -2 * X.T @ (y - X @ w)
 
X = np.array([[1, 2], [3, 4], [5, 6]], dtype=float)
y = np.array([1, 2, 3], dtype=float)
w = np.array([0.1, 0.2], dtype=float)
 
residual = y - X @ w
grad_analytical = -2 * X.T @ residual
 
# Numerical gradient (finite differences) — for verification
eps = 1e-5
grad_numerical = np.zeros_like(w)
for i in range(len(w)):
    w_plus = w.copy(); w_plus[i] += eps
    w_minus = w.copy(); w_minus[i] -= eps
    loss_plus = np.sum((y - X @ w_plus) ** 2)
    loss_minus = np.sum((y - X @ w_minus) ** 2)
    grad_numerical[i] = (loss_plus - loss_minus) / (2 * eps)
 
# These should match (up to floating point noise)
print(np.allclose(grad_analytical, grad_numerical))  # True

Numerical gradients are O(n) forward passes for n parameters — way too slow for real models. That’s why we use Backpropagation (analytical gradients via the chain rule) in practice.