Dropout
What
During training, randomly set a fraction of neuron outputs to zero. Each forward pass uses a different random subset of the network. At inference time, use all neurons but scale weights down (or equivalently, scale up during training — “inverted dropout”, which PyTorch uses).
Why it works
- Implicit ensemble: each training step uses a different sub-network. The final model is like an average of exponentially many thin networks
- Reduces co-adaptation: neurons can’t rely on specific other neurons always being present, so they learn more robust features independently
- Cheap regularization: one hyperparameter (drop rate), almost no computational overhead
Typical rates
- 0.1-0.3 for input layers and early layers
- 0.3-0.5 for hidden layers (0.5 was the original paper’s recommendation)
- Never on the output layer — you need all outputs for the prediction
- Higher dropout = stronger regularization = more training epochs needed
PyTorch example
import torch.nn as nn
class Net(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(784, 256)
self.fc2 = nn.Linear(256, 10)
self.dropout = nn.Dropout(p=0.5)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = self.dropout(x) # applied AFTER activation
return self.fc2(x)
# dropout is automatically disabled during model.eval()Spatial dropout for CNNs
Standard dropout zeros individual values, but in CNNs adjacent pixels are correlated — zeroing one pixel barely matters. Spatial dropout (Dropout2d) drops entire feature maps (channels) instead, forcing the network to not rely on any single feature detector.
self.spatial_dropout = nn.Dropout2d(p=0.2) # drops entire channelsComparison with other regularization
| Method | Mechanism | When to prefer |
|---|---|---|
| Dropout | Zero out neurons randomly | Dense layers, large networks |
| Weight decay (L2) | Penalize large weights | Always (often combined with dropout) |
| Batch Normalization | Normalize activations | CNNs (slight regularization as side effect) |
| Data augmentation | Expand training set | When you have image/text data |
| Early stopping | Stop before overfitting | Always as a safety net |
Links
- Regularization — dropout is a regularization technique
- Neurons and Activation Functions — what gets dropped
- Batch Normalization — often used together, but interaction can be tricky
- Deep Learning Roadmap