Chain Rule
What
The derivative of a composition of functions: multiply the derivatives along the chain.
f(g(x))' = f'(g(x)) · g'(x)
If y depends on u, and u depends on x:
dy/dx = dy/du · du/dx
Why it matters
Backpropagation IS the chain rule. A neural network is a chain of functions:
input → linear → relu → linear → relu → linear → loss
To compute how loss changes w.r.t. an early weight, multiply derivatives backwards through every layer. That’s it. That’s backprop.
Key ideas
- Chains can be arbitrarily long: dy/dx = dy/du₃ · du₃/du₂ · du₂/du₁ · du₁/dx
- Deeper networks = longer chains = more multiplications
- Vanishing gradients: multiplying many small numbers → gradient vanishes to ~0
- Exploding gradients: multiplying many large numbers → gradient explodes
- Solutions: ReLU activation, residual connections, gradient clipping, normalization