Dropout

What

During training, randomly set a fraction of neuron outputs to zero. Each forward pass uses a different random subset of the network. At inference time, use all neurons but scale weights down (or equivalently, scale up during training — “inverted dropout”, which PyTorch uses).

Why it works

Implicit ensemble: each training step uses a different sub-network. The final model is like an average of exponentially many thin networks
Reduces co-adaptation: neurons can’t rely on specific other neurons always being present, so they learn more robust features independently
Cheap regularization: one hyperparameter (drop rate), almost no computational overhead

Typical rates

0.1-0.3 for input layers and early layers
0.3-0.5 for hidden layers (0.5 was the original paper’s recommendation)
Never on the output layer — you need all outputs for the prediction
Higher dropout = stronger regularization = more training epochs needed

PyTorch example

import torch.nn as nn
 
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 256)
        self.fc2 = nn.Linear(256, 10)
        self.dropout = nn.Dropout(p=0.5)
 
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.dropout(x)         # applied AFTER activation
        return self.fc2(x)
 
# dropout is automatically disabled during model.eval()

Spatial dropout for CNNs

Standard dropout zeros individual values, but in CNNs adjacent pixels are correlated — zeroing one pixel barely matters. Spatial dropout (Dropout2d) drops entire feature maps (channels) instead, forcing the network to not rely on any single feature detector.

self.spatial_dropout = nn.Dropout2d(p=0.2)  # drops entire channels

Comparison with other regularization

Method	Mechanism	When to prefer
Dropout	Zero out neurons randomly	Dense layers, large networks
Weight decay (L2)	Penalize large weights	Always (often combined with dropout)
Batch Normalization	Normalize activations	CNNs (slight regularization as side effect)
Data augmentation	Expand training set	When you have image/text data
Early stopping	Stop before overfitting	Always as a safety net

AI/ML Notes

Explorer

Dropout

Dropout

What

Why it works

Typical rates

PyTorch example

Spatial dropout for CNNs

Comparison with other regularization

Links

Graph View

Table of Contents

Backlinks