Backpropagation
What
The algorithm that computes gradients in a neural network. It’s just the Chain Rule applied systematically from output to input.
How it works
Forward pass
Input flows through the network → compute prediction → compute loss.
Backward pass
- Compute ∂loss/∂output (how loss changes with output)
- Propagate backwards through each layer using the chain rule
- At each layer: ∂loss/∂weights = ∂loss/∂output × ∂output/∂weights
- Now every weight has a gradient → Gradient Descent updates them
In PyTorch
# Forward pass
predictions = model(X)
loss = loss_fn(predictions, y)
# Backward pass — computes all gradients automatically
loss.backward()
# Every parameter now has .grad
for name, param in model.named_parameters():
print(f"{name}: grad shape = {param.grad.shape}")PyTorch’s autograd builds a computation graph during the forward pass and walks it backwards. You never implement backprop manually.
Key insight
Backprop is efficient because it reuses intermediate results. Computing all gradients takes only ~2-3x the cost of the forward pass, regardless of network size.
Links
- Chain Rule — the math
- Gradient Descent — uses the gradients backprop computes
- Vanishing and Exploding Gradients — when backprop breaks down
- PyTorch Essentials — autograd handles this for you