Chain Rule

What

The derivative of a composition of functions: multiply the derivatives along the chain.

f(g(x))' = f'(g(x)) · g'(x)

If y depends on u, and u depends on x:

dy/dx = dy/du · du/dx

Why it matters

Backpropagation IS the chain rule. A neural network is a chain of functions:

input → linear → relu → linear → relu → linear → loss

To compute how loss changes w.r.t. an early weight, multiply derivatives backwards through every layer. That’s it. That’s backprop.

Key ideas

  • Chains can be arbitrarily long: dy/dx = dy/du₃ · du₃/du₂ · du₂/du₁ · du₁/dx
  • Deeper networks = longer chains = more multiplications
  • Vanishing gradients: multiplying many small numbers → gradient vanishes to ~0
  • Exploding gradients: multiplying many large numbers → gradient explodes
  • Solutions: ReLU activation, residual connections, gradient clipping, normalization