Weight Initialization

What

How you set the initial values of neural network weights before training. Sounds trivial, but bad initialization can make training fail completely.

Why it matters

  • Too large: activations explode through layers, gradients blow up, loss goes to NaN
  • Too small: activations shrink to zero through layers, gradients vanish, nothing learns
  • All zeros: every neuron computes the same thing. Gradients are identical. They all update the same way. The network is effectively one neuron wide — this is the symmetry problem

The goal: keep activations and gradients at roughly the same scale across all layers.

Xavier / Glorot initialization

Designed for sigmoid and tanh activations. Weights drawn from:

Normal:  W ~ N(0, 1/n_in)
Uniform: W ~ U(-√(6/(n_in + n_out)), √(6/(n_in + n_out)))

Where n_in and n_out are the number of input and output neurons for the layer. Keeps variance of activations constant across layers.

He / Kaiming initialization

Designed for ReLU activations. ReLU zeros out half the values, so you need to compensate with larger initial weights:

Normal:  W ~ N(0, 2/n_in)
Uniform: W ~ U(-√(6/n_in), √(6/n_in))

The factor of 2 (instead of 1) accounts for ReLU killing half the activations.

The intuition

If a layer has n inputs, each output is a sum of n terms. To keep the variance of the output equal to the variance of the input, each weight should have variance ≈ 1/n. That’s it — the rest is details about which activation function you’re using.

PyTorch defaults

import torch.nn as nn
 
# Linear layers use Kaiming uniform by default
layer = nn.Linear(256, 128)
 
# Manual initialization if needed
nn.init.kaiming_normal_(layer.weight, mode="fan_in", nonlinearity="relu")
nn.init.xavier_normal_(layer.weight)  # for tanh/sigmoid
nn.init.zeros_(layer.bias)            # bias init to zero is fine

Practical rule

Use the framework defaults. PyTorch and TensorFlow already match initialization to layer types. You almost never need to change this unless you’re building a custom architecture or debugging training instability.

ActivationUsePyTorch default
ReLU / variantsHe / KaimingYes (for Linear)
Sigmoid / TanhXavier / GlorotNeed to set manually
None (output layer)Xavier or small randomDepends on layer type