Backpropagation Step by Step
Goal: Manually compute forward and backward passes through a tiny network with actual numbers. Then verify with code. No mystery, no magic.
Prerequisites: Backpropagation, Chain Rule, Partial Derivatives, Neurons and Activation Functions
The Network
A minimal network: 2 inputs, 2 hidden neurons (ReLU), 1 output (sigmoid), MSE loss.
x1 ──┐ ┌── h1 ──┐
├─ W1 ─┤ ├─ W2 ─── o ─── Loss
x2 ──┘ └── h2 ──┘
Initial Values
Inputs: x1 = 0.5, x2 = 0.8
Target: t = 1.0
Weights W1 (input → hidden):
w11 = 0.1, w12 = 0.3
w21 = 0.2, w22 = 0.4
Biases b1: b1 = 0.1, b2 = 0.1
Weights W2 (hidden → output):
v1 = 0.5, v2 = 0.6
Bias b2_out: b_out = 0.1
Forward Pass (by hand)
Hidden layer pre-activation
Hidden layer activation (ReLU)
Output pre-activation
Output activation (sigmoid)
Loss (MSE)
Backward Pass (by hand)
Now we trace the gradient backward, layer by layer, using the chain rule.
Step 1: Loss → Output
Step 2: Output activation → Pre-activation
Sigmoid derivative:
Chain rule:
Let’s call this .
Step 3: Output weights and bias
Step 4: Hidden layer
The error propagates through W2:
ReLU derivative: if , else . Both and :
Step 5: Input weights
Verify with Code
import numpy as np
# Setup
x = np.array([0.5, 0.8])
t = 1.0
W1 = np.array([[0.1, 0.3],
[0.2, 0.4]])
b1 = np.array([0.1, 0.1])
W2 = np.array([0.5, 0.6])
b_out = 0.1
# Forward
z_hidden = x @ W1 + b1
h = np.maximum(0, z_hidden) # ReLU
z_out = h @ W2 + b_out
o = 1 / (1 + np.exp(-z_out)) # sigmoid
loss = 0.5 * (t - o) ** 2
print(f"z_hidden = {z_hidden}") # [0.31, 0.57]
print(f"h = {h}") # [0.31, 0.57]
print(f"z_out = {z_out:.4f}") # 0.597
print(f"output = {o:.4f}") # 0.6449
print(f"loss = {loss:.4f}") # 0.0631
# Backward
dL_do = -(t - o)
do_dz = o * (1 - o)
delta_o = dL_do * do_dz
dL_dW2 = delta_o * h
dL_db_out = delta_o
dL_dh = delta_o * W2
relu_mask = (z_hidden > 0).astype(float)
delta_h = dL_dh * relu_mask
dL_dW1 = np.outer(x, delta_h)
dL_db1 = delta_h
print(f"\ndL/dW2 = {dL_dW2}")
print(f"dL/db_out = {dL_db_out:.4f}")
print(f"dL/dW1 =\n{dL_dW1}")
print(f"dL/db1 = {dL_db1}")Check that these match the manual computations above (they will, up to rounding).
Numerical Gradient Check
The ultimate verification — perturb each weight by and measure the change in loss:
def numerical_gradient(param_name, idx, eps=1e-5):
"""Compute numerical gradient for a specific weight."""
params = {"W1": W1.copy(), "b1": b1.copy(), "W2": W2.copy(), "b_out": b_out}
def forward(p):
z_h = x @ p["W1"] + p["b1"]
h = np.maximum(0, z_h)
z_o = h @ p["W2"] + p["b_out"]
o = 1 / (1 + np.exp(-z_o))
return 0.5 * (t - o) ** 2
# Perturb +eps
p_plus = {k: v.copy() for k, v in params.items()}
if isinstance(idx, tuple):
p_plus[param_name][idx] += eps
else:
p_plus[param_name] += eps
loss_plus = forward(p_plus)
# Perturb -eps
p_minus = {k: v.copy() for k, v in params.items()}
if isinstance(idx, tuple):
p_minus[param_name][idx] -= eps
else:
p_minus[param_name] -= eps
loss_minus = forward(p_minus)
return (loss_plus - loss_minus) / (2 * eps)
# Check all gradients
print("Gradient verification (analytical vs numerical):")
print(f"dL/dW1[0,0]: analytical={dL_dW1[0,0]:.6f}, numerical={numerical_gradient('W1', (0,0)):.6f}")
print(f"dL/dW1[1,1]: analytical={dL_dW1[1,1]:.6f}, numerical={numerical_gradient('W1', (1,1)):.6f}")
print(f"dL/dW2[0]: analytical={dL_dW2[0]:.6f}, numerical={numerical_gradient('W2', 0):.6f}")
print(f"dL/dW2[1]: analytical={dL_dW2[1]:.6f}, numerical={numerical_gradient('W2', 1):.6f}")If analytical and numerical gradients agree to 5+ decimal places, your backprop is correct.
The Chain Rule Pattern
Every backward pass follows the same pattern:
δ_current = δ_upstream × local_gradient
Weight gradient = δ_current × input_to_this_layer
Bias gradient = δ_current
Input gradient = δ_current × weights (for passing to previous layer)
This is why backprop is efficient — it reuses intermediate computations. Computing all gradients is only ~2x the cost of a forward pass.
One Update Step
lr = 0.5
W1_new = W1 - lr * dL_dW1
b1_new = b1 - lr * dL_db1
W2_new = W2 - lr * dL_dW2
b_out_new = b_out - lr * dL_db_out
# Verify loss decreased
z_h = x @ W1_new + b1_new
h = np.maximum(0, z_h)
z_o = h @ W2_new + b_out_new
o_new = 1 / (1 + np.exp(-z_o))
loss_new = 0.5 * (t - o_new) ** 2
print(f"Before: output={o:.4f}, loss={loss:.4f}")
print(f"After: output={o_new:.4f}, loss={loss_new:.4f}")
print(f"Loss decreased: {loss_new < loss}")Exercises
-
Dead ReLU: Change to be negative by setting . Trace the backward pass. What happens to the gradient of ?
-
Sigmoid everywhere: Replace ReLU with sigmoid in the hidden layer. Recompute all gradients. Are the hidden gradients larger or smaller? (This is the vanishing gradient problem.)
-
Deeper network: Add a second hidden layer with 2 neurons. Trace the full backward pass. Count how many multiplications the gradient goes through from loss to the first layer.
-
Autograd verification: Implement the same network in PyTorch with
requires_grad=True. Comparetensor.gradwith your manual results.
Next: 08 - Attention Mechanism from Scratch — the key building block of transformers.