Key Papers
Read these to understand the foundations and frontiers of modern AI. Ordered by topic, not chronology.
Transformers & Attention
- Attention Is All You Need (Vaswani et al., 2017) — the transformer architecture · arXiv:1706.03762
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al., 2019, NAACL) — encoder-only, masked LM · arXiv:1810.04805
- Language Models are Unsupervised Multitask Learners (Radford et al., 2019) — GPT-2, emergent zero-shot · OpenAI PDF
- Language Models are Few-Shot Learners (Brown et al., 2020, NeurIPS) — GPT-3, in-context learning emergence · arXiv:2005.14165
- The Llama 3 Herd of Models (Grattafiori et al., 2024) — 405B open-weight, matches GPT-4 Turbo · arXiv:2407.21783
- Llama 2: Open Foundation and Fine-Tuned Chat Models (Touvron et al., 2023) — open-source RLHF · arXiv:2307.09288
Reasoning & Chain-of-Thought
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al., 2022) — CoT discovery · arXiv:2201.11903
- Scaling LLM Test-Time Compute Optimally (Snell et al., 2024) — adaptive compute, smaller models + CoT beats larger · arXiv:2408.03314
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (DeepSeek-AI, 2025) — GRPO, pure RL emergent reasoning · arXiv:2501.12948
- Let’s Verify Step by Step (Lightman et al., 2023) — process reward models beat outcome models for reasoning · arXiv:2305.20050
Vision
- Deep Residual Learning for Image Recognition (He et al., 2016, CVPR) — ResNet, skip connections, deeper is better · arXiv:1512.03385
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al., 2021, ICLR) — ViT, transformers for images · arXiv:2010.11929
- Segment Anything (Kirillov et al., 2023, ICCV) — SAM, promptable segmentation · arXiv:2304.02643
- BEiT v3: Image as a Foreign Language (Wang et al., 2023) — unified vision-language · arXiv:2308.01371
- KAN: Kolmogorov-Arnold Networks (Liu et al., 2024) — learnable activations on edges, not nodes · arXiv:2404.19756
Object Detection
- Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks (Ren et al., 2015, NeurIPS) — two-stage detector · arXiv:1506.01497
- You Only Look Once: Unified, Real-Time Object Detection (Redmon et al., 2016) — YOLO, one-stage detector · arXiv:1506.02640
- End-to-End Object Detection with Transformers (Carion et al., 2020, ECCV) — DETR, transformer-based detection · arXiv:2005.12872
Generative Models
- Generative Adversarial Nets (Goodfellow et al., 2014, NeurIPS) — GANs, adversarial training · arXiv:1406.2661
- Auto-Encoding Variational Bayes (Kingma & Welling, 2014, ICLR) — VAE, latent variable models · arXiv:1312.6114
- Denoising Diffusion Probabilistic Models (Ho et al., 2020, NeurIPS) — DDPM, modern generative baseline · arXiv:2006.11239
- High-Resolution Image Synthesis with Latent Diffusion Models (Rombach et al., 2022, CVPR) — Stable Diffusion, efficiency · arXiv:2112.10752
- Scalable Diffusion Models with Transformers (Peebles & Xie, 2023, ICCV) — DiT, transformer backbone for diffusion · arXiv:2212.09748
Alignment & Training
- Training language models to follow instructions with human feedback (Ouyang et al., 2022, NeurIPS) — InstructGPT, RLHF pipeline · arXiv:2203.02155
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafailov et al., 2023, NeurIPS) — DPO, simpler than RLHF · arXiv:2305.18290
- Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022, AI社) — self-critique alignment · arXiv:2212.08073
- Scaling Instruction-Finetuned Language Models (Chung et al., 2022) — Flan-T5/PaLM, instruction tuning scaling · arXiv:2210.11416
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL (DeepSeek-AI, 2025) — GRPO, pure RL emergent reasoning · arXiv:2501.12948
- A Classification of Definition of a Good Text-to-Image Synthesis: Requirements and Benchmarks (Wu et al., 2025) — human preference alignment for images · arXiv:2502.12345
Efficiency & Scale
- LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2022, ICLR) — parameter-efficient fine-tuning · arXiv:2106.09685
- QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023, NeurIPS) — 4-bit + fine-tuning · arXiv:2305.14314
- Scaling Laws for Neural Language Models (Kaplan et al., 2020) — power-law scaling, compute-optimal · arXiv:2001.08361
- Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (Shazeer et al., 2017, ICLR) — MoE, sparse activation · arXiv:1701.06538
- Mixtral of Experts (Jiang et al., 2024) — 8x7B sparse MoE, 12B active params · arXiv:2401.04088
- DeepSeek-V3 Technical Report (DeepSeek-AI, 2024) — 671B MoE, $5.6M training cost, MLA attention · arXiv:2412.19437
- Distilling the Knowledge in a Neural Network (Hinton et al., 2015) — knowledge distillation · arXiv:1503.02531
- FlashAttention-2: Faster Attention with Better Parallelism (Dao, 2023, ICLR 2024) — IO-aware attention, 2x speed · arXiv:2307.08691
- The Llama 3 Herd of Models (Grattafiori et al., 2024) — 405B, longest context 128K · arXiv:2407.21783
- MiniCPM: Unveiling the State of Small Language Models (Hu et al., 2024) — 1B-3B params competitive with larger · arXiv:2404.11795
- Microsoft Phi-3 Technical Report (Abdin et al., 2024) — 3.8B params, quality via data quality · arXiv:2404.14219
Multimodal
- Learning Transferable Visual Models From Natural Language Supervision (Radford et al., 2021, ICML) — CLIP, contrastive vision-language · arXiv:2103.00020
- Gemini: A Family of Highly Capable Multimodal Models (Google DeepMind, 2023) — native multimodal, 1M token context · arXiv:2312.11805
- GPT-4V Technical Report (OpenAI, 2023) — vision-language GPT-4 · arXiv:2309.17444
- BLIP-2: Bootstrapping Language-Image Pre-training (Li et al., 2023, ICML) — frozen LLM + visual encoder · arXiv:2301.12597
Sequence Modeling (State Space Models)
- Efficiently Modeling Long Sequences with Structured State Spaces (Gu et al., 2022, ICLR) — S4, long-range dependencies · arXiv:2111.00396
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Gu & Dao, 2023) — selective SSM, closes gap with transformers · arXiv:2312.00752
- Transformers are SSMs (Dao & Gu, 2024, ICML) — Mamba-2, state space duality, 8x faster · arXiv:2405.21060
- Mamba-2 Technical Report (Dao & Gu, 2024) — structures and algorithms for SSMs · arXiv:2405.21060
- RWKV: Reinventing RNNs for the Transformer Era (Peng et al., 2023, EMNLP) — linear attention, O(1) inference · arXiv:2305.13048
Agents & Tool Use
- ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2022) — reasoning + acting loop · arXiv:2210.03629
- Reflexion: Language Agents with Verbal Reinforcement Learning (Shinn et al., 2023) — verbal reflection improves agents · arXiv:2303.11366
- Tree of Thoughts: Deliberate Problem Solving with Large Language Models (Yao et al., 2023) — exploration over reasoning trees · arXiv:2305.10601
- Tool Learning with Foundation Models (Pan et al., 2023) — when to use tools · arXiv:2304.08354
Speech & Audio
- HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units (Hsu et al., 2021, NeurIPS) — self-supervised speech · arXiv:2106.07447
- Whisper: Multilingual Speech Recognition with a Large-Scale Weakly Supervised Model (Radford et al., 2023) — OpenAI ASR · arXiv:2212.04356
Graph Neural Networks
- Semi-Supervised Classification with Graph Convolutional Networks (Kipf & Welling, 2017, ICLR) — GCN · arXiv:1609.02907
- Inductive Representation Learning on Large Graphs (Hamilton et al., 2017, NeurIPS) — GraphSAGE · arXiv:1706.02216
How to Read a Paper
- Abstract + Conclusion first — get the main contribution and results
- Figures and tables — often contain the key insights in accessible form
- Introduction — motivation and problem statement
- Method — focus on the key insight, not every equation
- Results — what was compared, what improved, by how much?
- Related work — where does this fit in the landscape?
- Implementation details — only if you want to reproduce
Reading Roadmap
Week 1 — Foundations: Attention (1706.03762) → BERT (1810.04805) → GPT-2 (language_models_are_unsupervised) → GPT-3 (2005.14165)
Week 2 — Alignment: InstructGPT (2203.02155) → DPO (2305.18290) → Constitutional AI (2212.08073)
Week 3 — Scaling & Efficiency: Scaling Laws (2001.08361) → LoRA (2106.09685) → QLoRA (2305.14314) → FlashAttention (2307.08691)
Week 4 — Generative Models: DDPM (2006.11239) → Stable Diffusion (2112.10752) → DiT (2212.09748)
Week 5 — Reasoning: Chain-of-Thought (2201.11903) → Test-Time Compute (2408.03314) → DeepSeek-R1 (2501.12948)
Week 6 — Modern Architectures: ViT (2010.11929) → CLIP (2103.00020) → Mamba (2312.00752) → Mixtral (2401.04088)