Weight Initialization
What
How you set the initial values of neural network weights before training. Sounds trivial, but bad initialization can make training fail completely.
Why it matters
- Too large: activations explode through layers, gradients blow up, loss goes to NaN
- Too small: activations shrink to zero through layers, gradients vanish, nothing learns
- All zeros: every neuron computes the same thing. Gradients are identical. They all update the same way. The network is effectively one neuron wide — this is the symmetry problem
The goal: keep activations and gradients at roughly the same scale across all layers.
Xavier / Glorot initialization
Designed for sigmoid and tanh activations. Weights drawn from:
Normal: W ~ N(0, 1/n_in)
Uniform: W ~ U(-√(6/(n_in + n_out)), √(6/(n_in + n_out)))
Where n_in and n_out are the number of input and output neurons for the layer. Keeps variance of activations constant across layers.
He / Kaiming initialization
Designed for ReLU activations. ReLU zeros out half the values, so you need to compensate with larger initial weights:
Normal: W ~ N(0, 2/n_in)
Uniform: W ~ U(-√(6/n_in), √(6/n_in))
The factor of 2 (instead of 1) accounts for ReLU killing half the activations.
The intuition
If a layer has n inputs, each output is a sum of n terms. To keep the variance of the output equal to the variance of the input, each weight should have variance ≈ 1/n. That’s it — the rest is details about which activation function you’re using.
PyTorch defaults
import torch.nn as nn
# Linear layers use Kaiming uniform by default
layer = nn.Linear(256, 128)
# Manual initialization if needed
nn.init.kaiming_normal_(layer.weight, mode="fan_in", nonlinearity="relu")
nn.init.xavier_normal_(layer.weight) # for tanh/sigmoid
nn.init.zeros_(layer.bias) # bias init to zero is finePractical rule
Use the framework defaults. PyTorch and TensorFlow already match initialization to layer types. You almost never need to change this unless you’re building a custom architecture or debugging training instability.
| Activation | Use | PyTorch default |
|---|---|---|
| ReLU / variants | He / Kaiming | Yes (for Linear) |
| Sigmoid / Tanh | Xavier / Glorot | Need to set manually |
| None (output layer) | Xavier or small random | Depends on layer type |
Links
- Vanishing and Exploding Gradients — what happens with bad initialization
- Neurons and Activation Functions — initialization depends on activation choice
- Deep Learning Roadmap