Scaling Laws
What
Model performance improves predictably with more compute, data, and parameters following power-law relationships. This means you can forecast how good a model will be before training it — and decide how to allocate your budget.
Power-law relationships
Kaplan et al. (2020) at OpenAI showed that loss decreases as a smooth power law in three variables:
L(N) ~ N^(-0.076) # N = number of parameters
L(D) ~ D^(-0.095) # D = dataset size (tokens)
L(C) ~ C^(-0.050) # C = compute budget (FLOPs)
Double the parameters? Loss drops predictably. Ten-x the compute? You can estimate exactly how much better the model gets. These relationships hold over many orders of magnitude — from millions to hundreds of billions of parameters.
Chinchilla: the wake-up call
Hoffmann et al. (2022) at DeepMind trained 400+ models and found that most large models were seriously undertrained. The compute-optimal ratio:
optimal tokens ~ 20 * parameters
# GPT-3 (175B params) trained on 300B tokens → undertrained
# Chinchilla (70B params) trained on 1.4T tokens → matched GPT-3
This shifted the field: instead of making models bigger and bigger with limited data, train smaller models on much more data. LLaMA, Mistral, and most modern open models follow this insight — smaller but trained on trillions of tokens.
Compute-optimal training
Given a fixed compute budget C, there’s an optimal split between model size N and data D:
N_opt ~ C^0.5 # scale params with sqrt of compute
D_opt ~ C^0.5 # scale data equally
Spend half your budget on a bigger model and half on more data. Before Chinchilla, the field spent disproportionately on model size.
Emergence
Some capabilities appear to suddenly “turn on” at certain scales — models below a threshold get 0% on a task, then jump to high accuracy. Examples: multi-step arithmetic, chain-of-thought reasoning, code generation.
But this is debated. Schaeffer et al. (2023) argued that emergence is partly an artifact of how we measure — switch from accuracy (discontinuous) to log-likelihood (continuous) and the “sudden” jump smooths out. The reality is probably somewhere in between: capabilities improve gradually, but usability thresholds are real.
Landmark models
| Model | Year | Params | Tokens | Key insight |
|---|---|---|---|---|
| GPT-2 | 2019 | 1.5B | 40B | Scale unlocks generation quality |
| GPT-3 | 2020 | 175B | 300B | Few-shot learning emerges at scale |
| Chinchilla | 2022 | 70B | 1.4T | Compute-optimal = more data, fewer params |
| LLaMA | 2023 | 7-65B | 1-1.4T | Open model following Chinchilla ratios |
| Mistral 7B | 2023 | 7B | undisclosed | Punches above its weight with good data |
| LLaMA 3 | 2024 | 8-405B | 15T+ | Massive data scaling on efficient models |
Why it matters
Scaling laws let you plan: estimate the compute needed for a target performance, decide model size vs data tradeoffs, and predict whether scaling further is worth the cost. They’re the closest thing ML has to engineering equations.
Key papers
- Scaling Laws for Neural Language Models (Kaplan et al., 2020)
- Training Compute-Optimal Large Language Models (Hoffmann et al., 2022) — Chinchilla
Links
- Transformers — the architecture these laws were measured on
- Language Models — what’s being scaled
- Key Papers
- Modern AI Techniques — section roadmap