Vanishing and Exploding Gradients
What
In deep networks, gradients are multiplied through many layers (Chain Rule). This multiplication can cause gradients to:
- Vanish (→ 0): early layers stop learning
- Explode (→ ∞): training becomes unstable, loss goes to NaN
Why it happens
- Sigmoid/tanh: derivatives are < 1 → multiply many of them → vanishes
- Large weights or poor initialization → gradients compound exponentially
Solutions
| Problem | Solution |
|---|---|
| Vanishing | ReLU activation (derivative = 1 for positive values) |
| Vanishing | Residual connections (skip connections) — add input to output |
| Vanishing | Better initialization (He, Xavier) |
| Vanishing | Batch Normalization |
| Exploding | Gradient clipping: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) |
| Exploding | Lower learning rate |
| Both | Proper weight initialization |
Residual connections (the big fix)
output = layer(input) + input ← skip connection
Gradients can flow directly through the skip path. This is why ResNets (2015) enabled training 100+ layer networks. Also used in transformers.