Deep Residual Learning for Image Recognition

Kaiming He et al. (2015)

Read paper

Why It Matters

Skip connections enabling 152+ layer networks. Won ILSVRC 2015. ResNet remains a backbone in vision and multi-modal systems.

Key Ideas

  1. Very deep plain networks suffered from a degradation problem: adding layers made optimization worse even when overfitting was not the issue.
  2. Residual blocks solve this by learning a residual function F(x) and adding the input back through a skip connection, making identity mappings easy to preserve.
  3. Bottleneck blocks let the network go much deeper without exploding compute by using 1x1 -> 3x3 -> 1x1 structure.
  4. Residual design changed deep vision backbones permanently; later CNNs and even Transformer variants reuse the same skip-connection idea.

Notes

  • ResNet won ILSVRC 2015 and showed that 152 layers could train reliably.
  • The key claim is not just “deeper is better” but “deeper becomes trainable if optimization can fall back to identity.”
  • Skip connections later became standard far beyond vision: sequence models, multimodal models, diffusion backbones, and modern MLP stacks all rely on residual pathways.