Vanishing and Exploding Gradients

What

In deep networks, gradients are multiplied through many layers (Chain Rule). This multiplication can cause gradients to:

Problem	Solution
Vanishing	ReLU activation (derivative = 1 for positive values)
Vanishing	Residual connections (skip connections) — add input to output
Vanishing	Better initialization (He, Xavier)
Vanishing	Batch Normalization
Exploding	Gradient clipping: `torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)`
Exploding	Lower learning rate
Both	Proper weight initialization

output = layer(input) + input   ← skip connection

Gradients can flow directly through the skip path. This is why ResNets (2015) enabled training 100+ layer networks. Also used in transformers.