Deep Residual Learning for Image Recognition
Kaiming He et al. (2015)
Why It Matters
Skip connections enabling 152+ layer networks. Won ILSVRC 2015. ResNet remains a backbone in vision and multi-modal systems.
Key Ideas
- Very deep plain networks suffered from a degradation problem: adding layers made optimization worse even when overfitting was not the issue.
- Residual blocks solve this by learning a residual function
F(x)and adding the input back through a skip connection, making identity mappings easy to preserve. - Bottleneck blocks let the network go much deeper without exploding compute by using
1x1 -> 3x3 -> 1x1structure. - Residual design changed deep vision backbones permanently; later CNNs and even Transformer variants reuse the same skip-connection idea.
Notes
- ResNet won ILSVRC 2015 and showed that 152 layers could train reliably.
- The key claim is not just “deeper is better” but “deeper becomes trainable if optimization can fall back to identity.”
- Skip connections later became standard far beyond vision: sequence models, multimodal models, diffusion backbones, and modern MLP stacks all rely on residual pathways.