Dropout

What

During training, randomly set a fraction of neuron outputs to zero. Each forward pass uses a different random subset of the network. At inference time, use all neurons but scale weights down (or equivalently, scale up during training — “inverted dropout”, which PyTorch uses).

Why it works

  • Implicit ensemble: each training step uses a different sub-network. The final model is like an average of exponentially many thin networks
  • Reduces co-adaptation: neurons can’t rely on specific other neurons always being present, so they learn more robust features independently
  • Cheap regularization: one hyperparameter (drop rate), almost no computational overhead

Typical rates

  • 0.1-0.3 for input layers and early layers
  • 0.3-0.5 for hidden layers (0.5 was the original paper’s recommendation)
  • Never on the output layer — you need all outputs for the prediction
  • Higher dropout = stronger regularization = more training epochs needed

PyTorch example

import torch.nn as nn
 
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 256)
        self.fc2 = nn.Linear(256, 10)
        self.dropout = nn.Dropout(p=0.5)
 
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.dropout(x)         # applied AFTER activation
        return self.fc2(x)
 
# dropout is automatically disabled during model.eval()

Spatial dropout for CNNs

Standard dropout zeros individual values, but in CNNs adjacent pixels are correlated — zeroing one pixel barely matters. Spatial dropout (Dropout2d) drops entire feature maps (channels) instead, forcing the network to not rely on any single feature detector.

self.spatial_dropout = nn.Dropout2d(p=0.2)  # drops entire channels

Comparison with other regularization

MethodMechanismWhen to prefer
DropoutZero out neurons randomlyDense layers, large networks
Weight decay (L2)Penalize large weightsAlways (often combined with dropout)
Batch NormalizationNormalize activationsCNNs (slight regularization as side effect)
Data augmentationExpand training setWhen you have image/text data
Early stoppingStop before overfittingAlways as a safety net