Vanishing and Exploding Gradients

What

In deep networks, gradients are multiplied through many layers (Chain Rule). This multiplication can cause gradients to:

  • Vanish (→ 0): early layers stop learning
  • Explode (→ ∞): training becomes unstable, loss goes to NaN

Why it happens

  • Sigmoid/tanh: derivatives are < 1 → multiply many of them → vanishes
  • Large weights or poor initialization → gradients compound exponentially

Solutions

ProblemSolution
VanishingReLU activation (derivative = 1 for positive values)
VanishingResidual connections (skip connections) — add input to output
VanishingBetter initialization (He, Xavier)
VanishingBatch Normalization
ExplodingGradient clipping: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
ExplodingLower learning rate
BothProper weight initialization

Residual connections (the big fix)

output = layer(input) + input   ← skip connection

Gradients can flow directly through the skip path. This is why ResNets (2015) enabled training 100+ layer networks. Also used in transformers.