Weight Initialization

What

How you set the initial values of neural network weights before training. Sounds trivial, but bad initialization can make training fail completely.

Why it matters

Too large: activations explode through layers, gradients blow up, loss goes to NaN
Too small: activations shrink to zero through layers, gradients vanish, nothing learns
All zeros: every neuron computes the same thing. Gradients are identical. They all update the same way. The network is effectively one neuron wide — this is the symmetry problem

The goal: keep activations and gradients at roughly the same scale across all layers.

Xavier / Glorot initialization

Designed for sigmoid and tanh activations. Weights drawn from:

Normal:  W ~ N(0, 1/n_in)
Uniform: W ~ U(-√(6/(n_in + n_out)), √(6/(n_in + n_out)))

Where n_in and n_out are the number of input and output neurons for the layer. Keeps variance of activations constant across layers.

He / Kaiming initialization

Designed for ReLU activations. ReLU zeros out half the values, so you need to compensate with larger initial weights:

Normal:  W ~ N(0, 2/n_in)
Uniform: W ~ U(-√(6/n_in), √(6/n_in))

The factor of 2 (instead of 1) accounts for ReLU killing half the activations.

The intuition

If a layer has n inputs, each output is a sum of n terms. To keep the variance of the output equal to the variance of the input, each weight should have variance ≈ 1/n. That’s it — the rest is details about which activation function you’re using.

PyTorch defaults

import torch.nn as nn
 
# Linear layers use Kaiming uniform by default
layer = nn.Linear(256, 128)
 
# Manual initialization if needed
nn.init.kaiming_normal_(layer.weight, mode="fan_in", nonlinearity="relu")
nn.init.xavier_normal_(layer.weight)  # for tanh/sigmoid
nn.init.zeros_(layer.bias)            # bias init to zero is fine

Practical rule

Use the framework defaults. PyTorch and TensorFlow already match initialization to layer types. You almost never need to change this unless you’re building a custom architecture or debugging training instability.

Activation	Use	PyTorch default
ReLU / variants	He / Kaiming	Yes (for Linear)
Sigmoid / Tanh	Xavier / Glorot	Need to set manually
None (output layer)	Xavier or small random	Depends on layer type

AI/ML Notes

Explorer

Weight Initialization

Weight Initialization

What

Why it matters

Xavier / Glorot initialization

He / Kaiming initialization

The intuition

PyTorch defaults

Practical rule

Links

Graph View

Table of Contents