Deep Residual Learning for Image Recognition

Kaiming He et al. (2015)

Why It Matters

Skip connections enabling 152+ layer networks. Won ILSVRC 2015. ResNet remains a backbone in vision and multi-modal systems.

Very deep plain networks suffered from a degradation problem: adding layers made optimization worse even when overfitting was not the issue.
Residual blocks solve this by learning a residual function F(x) and adding the input back through a skip connection, making identity mappings easy to preserve.
Bottleneck blocks let the network go much deeper without exploding compute by using 1x1 -> 3x3 -> 1x1 structure.
Residual design changed deep vision backbones permanently; later CNNs and even Transformer variants reuse the same skip-connection idea.

ResNet won ILSVRC 2015 and showed that 152 layers could train reliably.
The key claim is not just “deeper is better” but “deeper becomes trainable if optimization can fall back to identity.”
Skip connections later became standard far beyond vision: sequence models, multimodal models, diffusion backbones, and modern MLP stacks all rely on residual pathways.