Residual Networks
What
Neural networks with skip connections (shortcut connections) that let input bypass one or more layers. Instead of learning a full mapping H(x), the layers learn a residual F(x) = H(x) - x, and the output is y = F(x) + x.
The problem ResNets solved
Before 2015, deeper networks performed worse than shallower ones — not because of overfitting, but because gradients degraded through many layers. A 56-layer network had higher training error than a 20-layer one. That shouldn’t happen if depth only adds capacity.
Why skip connections work
- Gradient highway: gradients flow directly through the skip connection, bypassing layers that might squash them. This solves Vanishing and Exploding Gradients
- Easy to learn identity: if a layer isn’t helpful, the weights can go to zero and the block passes input through unchanged. The network can’t get worse by being deeper
- Residual is easier to learn: learning a small adjustment to the input is simpler than learning the full transformation from scratch
Input x ──┬──→ [Conv → BN → ReLU → Conv → BN] ──→ (+) ──→ ReLU ──→ Output
│ ↑
└─────────── skip connection ─────────────┘
ResNet architecture
| Variant | Layers | Parameters | Top-1 accuracy (ImageNet) |
|---|---|---|---|
| ResNet-18 | 18 | 11M | ~69% |
| ResNet-34 | 34 | 21M | ~73% |
| ResNet-50 | 50 | 25M | ~76% |
| ResNet-101 | 101 | 44M | ~77% |
| ResNet-152 | 152 | 60M | ~78% |
Bottleneck blocks
ResNet-50+ uses bottleneck blocks to reduce computation:
1x1 conv (reduce channels: 256 → 64)
3x3 conv (process at reduced dimension)
1x1 conv (expand channels: 64 → 256)
The 1x1 convolutions squeeze and expand the channel dimension, so the expensive 3x3 convolution works on fewer channels. This makes deeper networks practical.
Impact
- Enabled training networks with 100+ layers (up to 1000+ in experiments)
- Won ImageNet 2015 by a large margin
- Skip connections became standard in almost every modern architecture: DenseNet, U-Net, Transformers (residual connections around attention and FFN layers)
- Transfer Learning with pretrained ResNets is one of the most common starting points for vision tasks
Links
- Convolutional Neural Networks — the base architecture ResNets build on
- Vanishing and Exploding Gradients — the problem skip connections solve
- Batch Normalization — used in every residual block
- Transfer Learning — pretrained ResNets are the go-to starting point