Backpropagation Step by Step

Goal: Manually compute forward and backward passes through a tiny network with actual numbers. Then verify with code. No mystery, no magic.

Prerequisites: Backpropagation, Chain Rule, Partial Derivatives, Neurons and Activation Functions

The Network

A minimal network: 2 inputs, 2 hidden neurons (ReLU), 1 output (sigmoid), MSE loss.

x1 ──┐       ┌── h1 ──┐
     ├─ W1 ─┤         ├─ W2 ─── o ─── Loss
x2 ──┘       └── h2 ──┘

Initial Values

Inputs:  x1 = 0.5,  x2 = 0.8
Target:  t = 1.0

Weights W1 (input → hidden):
  w11 = 0.1,  w12 = 0.3
  w21 = 0.2,  w22 = 0.4

Biases b1: b1 = 0.1, b2 = 0.1

Weights W2 (hidden → output):
  v1 = 0.5,  v2 = 0.6

Bias b2_out: b_out = 0.1

Forward Pass (by hand)

Hidden layer pre-activation

$z_{1} = w_{11} x_{1} + w_{21} x_{2} + b_{1} = 0.1 (0.5) + 0.2 (0.8) + 0.1 = 0.05 + 0.16 + 0.1 = 0.31$

$z_{2} = w_{12} x_{1} + w_{22} x_{2} + b_{2} = 0.3 (0.5) + 0.4 (0.8) + 0.1 = 0.15 + 0.32 + 0.1 = 0.57$

Hidden layer activation (ReLU)

$h_{1} = ReLU (0.31) = 0.31$ $h_{2} = ReLU (0.57) = 0.57$

Output pre-activation

$z_{o} = v_{1} h_{1} + v_{2} h_{2} + b_{o u t} = 0.5 (0.31) + 0.6 (0.57) + 0.1 = 0.155 + 0.342 + 0.1 = 0.597$

Output activation (sigmoid)

$o = σ (0.597) = \frac{1}{1 + e ^{- 0.597}} = \frac{1}{1 + 0.5506} = 0.6449$

Loss (MSE)

$L = \frac{1}{2} (t - o)^{2} = \frac{1}{2} (1.0 - 0.6449)^{2} = \frac{1}{2} (0.3551)^{2} = 0.0631$

Backward Pass (by hand)

Now we trace the gradient backward, layer by layer, using the chain rule.

Step 1: Loss → Output

$\frac{\partial L}{\partial o} = - (t - o) = - (1.0 - 0.6449) = - 0.3551$

Step 2: Output activation → Pre-activation

Sigmoid derivative: $σ^{'} (z) = σ (z) (1 - σ (z))$

$\frac{\partial o}{\partial z _{o}} = o (1 - o) = 0.6449 \times 0.3551 = 0.2290$

Chain rule:

$\frac{\partial L}{\partial z _{o}} = \frac{\partial L}{\partial o} \cdot \frac{\partial o}{\partial z _{o}} = - 0.3551 \times 0.2290 = - 0.0813$

Let’s call this $δ_{o} = - 0.0813$ .

Step 3: Output weights and bias

$\frac{\partial L}{\partial v _{1}} = δ_{o} \cdot h_{1} = - 0.0813 \times 0.31 = - 0.0252$

$\frac{\partial L}{\partial v _{2}} = δ_{o} \cdot h_{2} = - 0.0813 \times 0.57 = - 0.0463$

$\frac{\partial L}{\partial b _{o u t}} = δ_{o} = - 0.0813$

Step 4: Hidden layer

The error propagates through W2:

$\frac{\partial L}{\partial h _{1}} = δ_{o} \cdot v_{1} = - 0.0813 \times 0.5 = - 0.0407$

$\frac{\partial L}{\partial h _{2}} = δ_{o} \cdot v_{2} = - 0.0813 \times 0.6 = - 0.0488$

ReLU derivative: $1$ if $z > 0$ , else $0$ . Both $z_{1} = 0.31 > 0$ and $z_{2} = 0.57 > 0$ :

$δ_{1} = - 0.0407 \times 1 = - 0.0407$ $δ_{2} = - 0.0488 \times 1 = - 0.0488$

Step 5: Input weights

$\frac{\partial L}{\partial w _{11}} = δ_{1} \cdot x_{1} = - 0.0407 \times 0.5 = - 0.0203$ $\frac{\partial L}{\partial w _{21}} = δ_{1} \cdot x_{2} = - 0.0407 \times 0.8 = - 0.0325$ $\frac{\partial L}{\partial w _{12}} = δ_{2} \cdot x_{1} = - 0.0488 \times 0.5 = - 0.0244$ $\frac{\partial L}{\partial w _{22}} = δ_{2} \cdot x_{2} = - 0.0488 \times 0.8 = - 0.0390$

Verify with Code

import numpy as np
 
# Setup
x = np.array([0.5, 0.8])
t = 1.0
 
W1 = np.array([[0.1, 0.3],
                [0.2, 0.4]])
b1 = np.array([0.1, 0.1])
W2 = np.array([0.5, 0.6])
b_out = 0.1
 
# Forward
z_hidden = x @ W1 + b1
h = np.maximum(0, z_hidden)  # ReLU
z_out = h @ W2 + b_out
o = 1 / (1 + np.exp(-z_out))  # sigmoid
loss = 0.5 * (t - o) ** 2
 
print(f"z_hidden = {z_hidden}")   # [0.31, 0.57]
print(f"h        = {h}")          # [0.31, 0.57]
print(f"z_out    = {z_out:.4f}")  # 0.597
print(f"output   = {o:.4f}")      # 0.6449
print(f"loss     = {loss:.4f}")   # 0.0631
 
# Backward
dL_do = -(t - o)
do_dz = o * (1 - o)
delta_o = dL_do * do_dz
 
dL_dW2 = delta_o * h
dL_db_out = delta_o
 
dL_dh = delta_o * W2
relu_mask = (z_hidden > 0).astype(float)
delta_h = dL_dh * relu_mask
 
dL_dW1 = np.outer(x, delta_h)
dL_db1 = delta_h
 
print(f"\ndL/dW2    = {dL_dW2}")
print(f"dL/db_out = {dL_db_out:.4f}")
print(f"dL/dW1    =\n{dL_dW1}")
print(f"dL/db1    = {dL_db1}")

Check that these match the manual computations above (they will, up to rounding).

Numerical Gradient Check

The ultimate verification — perturb each weight by $ϵ$ and measure the change in loss:

$\frac{\partial L}{\partial w} \approx \frac{L ( w + ϵ ) - L ( w - ϵ )}{2 ϵ}$

def numerical_gradient(param_name, idx, eps=1e-5):
    """Compute numerical gradient for a specific weight."""
    params = {"W1": W1.copy(), "b1": b1.copy(), "W2": W2.copy(), "b_out": b_out}
 
    def forward(p):
        z_h = x @ p["W1"] + p["b1"]
        h = np.maximum(0, z_h)
        z_o = h @ p["W2"] + p["b_out"]
        o = 1 / (1 + np.exp(-z_o))
        return 0.5 * (t - o) ** 2
 
    # Perturb +eps
    p_plus = {k: v.copy() for k, v in params.items()}
    if isinstance(idx, tuple):
        p_plus[param_name][idx] += eps
    else:
        p_plus[param_name] += eps
    loss_plus = forward(p_plus)
 
    # Perturb -eps
    p_minus = {k: v.copy() for k, v in params.items()}
    if isinstance(idx, tuple):
        p_minus[param_name][idx] -= eps
    else:
        p_minus[param_name] -= eps
    loss_minus = forward(p_minus)
 
    return (loss_plus - loss_minus) / (2 * eps)
 
# Check all gradients
print("Gradient verification (analytical vs numerical):")
print(f"dL/dW1[0,0]: analytical={dL_dW1[0,0]:.6f}, numerical={numerical_gradient('W1', (0,0)):.6f}")
print(f"dL/dW1[1,1]: analytical={dL_dW1[1,1]:.6f}, numerical={numerical_gradient('W1', (1,1)):.6f}")
print(f"dL/dW2[0]:   analytical={dL_dW2[0]:.6f}, numerical={numerical_gradient('W2', 0):.6f}")
print(f"dL/dW2[1]:   analytical={dL_dW2[1]:.6f}, numerical={numerical_gradient('W2', 1):.6f}")

If analytical and numerical gradients agree to 5+ decimal places, your backprop is correct.

The Chain Rule Pattern

Every backward pass follows the same pattern:

δ_current = δ_upstream × local_gradient

Weight gradient  = δ_current × input_to_this_layer
Bias gradient    = δ_current
Input gradient   = δ_current × weights  (for passing to previous layer)

This is why backprop is efficient — it reuses intermediate computations. Computing all gradients is only ~2x the cost of a forward pass.

One Update Step

lr = 0.5
 
W1_new = W1 - lr * dL_dW1
b1_new = b1 - lr * dL_db1
W2_new = W2 - lr * dL_dW2
b_out_new = b_out - lr * dL_db_out
 
# Verify loss decreased
z_h = x @ W1_new + b1_new
h = np.maximum(0, z_h)
z_o = h @ W2_new + b_out_new
o_new = 1 / (1 + np.exp(-z_o))
loss_new = 0.5 * (t - o_new) ** 2
 
print(f"Before: output={o:.4f}, loss={loss:.4f}")
print(f"After:  output={o_new:.4f}, loss={loss_new:.4f}")
print(f"Loss decreased: {loss_new < loss}")

Exercises

Dead ReLU: Change $z_{1}$ to be negative by setting $w_{11} = - 0.5, w_{21} = - 0.3$ . Trace the backward pass. What happens to the gradient of $w_{11}$ ?
Sigmoid everywhere: Replace ReLU with sigmoid in the hidden layer. Recompute all gradients. Are the hidden gradients larger or smaller? (This is the vanishing gradient problem.)
Deeper network: Add a second hidden layer with 2 neurons. Trace the full backward pass. Count how many multiplications the gradient goes through from loss to the first layer.
Autograd verification: Implement the same network in PyTorch with requires_grad=True. Compare tensor.grad with your manual results.

Next: 08 - Attention Mechanism from Scratch — the key building block of transformers.

AI/ML Notes

Explorer

07 - Backpropagation Step by Step