Backpropagation Step by Step

Goal: Manually compute forward and backward passes through a tiny network with actual numbers. Then verify with code. No mystery, no magic.

Prerequisites: Backpropagation, Chain Rule, Partial Derivatives, Neurons and Activation Functions


The Network

A minimal network: 2 inputs, 2 hidden neurons (ReLU), 1 output (sigmoid), MSE loss.

x1 ──┐       ┌── h1 ──┐
     ├─ W1 ─┤         ├─ W2 ─── o ─── Loss
x2 ──┘       └── h2 ──┘

Initial Values

Inputs:  x1 = 0.5,  x2 = 0.8
Target:  t = 1.0

Weights W1 (input → hidden):
  w11 = 0.1,  w12 = 0.3
  w21 = 0.2,  w22 = 0.4

Biases b1: b1 = 0.1, b2 = 0.1

Weights W2 (hidden → output):
  v1 = 0.5,  v2 = 0.6

Bias b2_out: b_out = 0.1

Forward Pass (by hand)

Hidden layer pre-activation

Hidden layer activation (ReLU)

Output pre-activation

Output activation (sigmoid)

Loss (MSE)


Backward Pass (by hand)

Now we trace the gradient backward, layer by layer, using the chain rule.

Step 1: Loss → Output

Step 2: Output activation → Pre-activation

Sigmoid derivative:

Chain rule:

Let’s call this .

Step 3: Output weights and bias

Step 4: Hidden layer

The error propagates through W2:

ReLU derivative: if , else . Both and :

Step 5: Input weights


Verify with Code

import numpy as np
 
# Setup
x = np.array([0.5, 0.8])
t = 1.0
 
W1 = np.array([[0.1, 0.3],
                [0.2, 0.4]])
b1 = np.array([0.1, 0.1])
W2 = np.array([0.5, 0.6])
b_out = 0.1
 
# Forward
z_hidden = x @ W1 + b1
h = np.maximum(0, z_hidden)  # ReLU
z_out = h @ W2 + b_out
o = 1 / (1 + np.exp(-z_out))  # sigmoid
loss = 0.5 * (t - o) ** 2
 
print(f"z_hidden = {z_hidden}")   # [0.31, 0.57]
print(f"h        = {h}")          # [0.31, 0.57]
print(f"z_out    = {z_out:.4f}")  # 0.597
print(f"output   = {o:.4f}")      # 0.6449
print(f"loss     = {loss:.4f}")   # 0.0631
 
# Backward
dL_do = -(t - o)
do_dz = o * (1 - o)
delta_o = dL_do * do_dz
 
dL_dW2 = delta_o * h
dL_db_out = delta_o
 
dL_dh = delta_o * W2
relu_mask = (z_hidden > 0).astype(float)
delta_h = dL_dh * relu_mask
 
dL_dW1 = np.outer(x, delta_h)
dL_db1 = delta_h
 
print(f"\ndL/dW2    = {dL_dW2}")
print(f"dL/db_out = {dL_db_out:.4f}")
print(f"dL/dW1    =\n{dL_dW1}")
print(f"dL/db1    = {dL_db1}")

Check that these match the manual computations above (they will, up to rounding).


Numerical Gradient Check

The ultimate verification — perturb each weight by and measure the change in loss:

def numerical_gradient(param_name, idx, eps=1e-5):
    """Compute numerical gradient for a specific weight."""
    params = {"W1": W1.copy(), "b1": b1.copy(), "W2": W2.copy(), "b_out": b_out}
 
    def forward(p):
        z_h = x @ p["W1"] + p["b1"]
        h = np.maximum(0, z_h)
        z_o = h @ p["W2"] + p["b_out"]
        o = 1 / (1 + np.exp(-z_o))
        return 0.5 * (t - o) ** 2
 
    # Perturb +eps
    p_plus = {k: v.copy() for k, v in params.items()}
    if isinstance(idx, tuple):
        p_plus[param_name][idx] += eps
    else:
        p_plus[param_name] += eps
    loss_plus = forward(p_plus)
 
    # Perturb -eps
    p_minus = {k: v.copy() for k, v in params.items()}
    if isinstance(idx, tuple):
        p_minus[param_name][idx] -= eps
    else:
        p_minus[param_name] -= eps
    loss_minus = forward(p_minus)
 
    return (loss_plus - loss_minus) / (2 * eps)
 
# Check all gradients
print("Gradient verification (analytical vs numerical):")
print(f"dL/dW1[0,0]: analytical={dL_dW1[0,0]:.6f}, numerical={numerical_gradient('W1', (0,0)):.6f}")
print(f"dL/dW1[1,1]: analytical={dL_dW1[1,1]:.6f}, numerical={numerical_gradient('W1', (1,1)):.6f}")
print(f"dL/dW2[0]:   analytical={dL_dW2[0]:.6f}, numerical={numerical_gradient('W2', 0):.6f}")
print(f"dL/dW2[1]:   analytical={dL_dW2[1]:.6f}, numerical={numerical_gradient('W2', 1):.6f}")

If analytical and numerical gradients agree to 5+ decimal places, your backprop is correct.


The Chain Rule Pattern

Every backward pass follows the same pattern:

δ_current = δ_upstream × local_gradient

Weight gradient  = δ_current × input_to_this_layer
Bias gradient    = δ_current
Input gradient   = δ_current × weights  (for passing to previous layer)

This is why backprop is efficient — it reuses intermediate computations. Computing all gradients is only ~2x the cost of a forward pass.


One Update Step

lr = 0.5
 
W1_new = W1 - lr * dL_dW1
b1_new = b1 - lr * dL_db1
W2_new = W2 - lr * dL_dW2
b_out_new = b_out - lr * dL_db_out
 
# Verify loss decreased
z_h = x @ W1_new + b1_new
h = np.maximum(0, z_h)
z_o = h @ W2_new + b_out_new
o_new = 1 / (1 + np.exp(-z_o))
loss_new = 0.5 * (t - o_new) ** 2
 
print(f"Before: output={o:.4f}, loss={loss:.4f}")
print(f"After:  output={o_new:.4f}, loss={loss_new:.4f}")
print(f"Loss decreased: {loss_new < loss}")

Exercises

  1. Dead ReLU: Change to be negative by setting . Trace the backward pass. What happens to the gradient of ?

  2. Sigmoid everywhere: Replace ReLU with sigmoid in the hidden layer. Recompute all gradients. Are the hidden gradients larger or smaller? (This is the vanishing gradient problem.)

  3. Deeper network: Add a second hidden layer with 2 neurons. Trace the full backward pass. Count how many multiplications the gradient goes through from loss to the first layer.

  4. Autograd verification: Implement the same network in PyTorch with requires_grad=True. Compare tensor.grad with your manual results.


Next: 08 - Attention Mechanism from Scratch — the key building block of transformers.