Scaling Laws

What

Model performance improves predictably with more compute, data, and parameters following power-law relationships. This means you can forecast how good a model will be before training it — and decide how to allocate your budget.

Power-law relationships

Kaplan et al. (2020) at OpenAI showed that loss decreases as a smooth power law in three variables:

L(N) ~ N^(-0.076)    # N = number of parameters
L(D) ~ D^(-0.095)    # D = dataset size (tokens)
L(C) ~ C^(-0.050)    # C = compute budget (FLOPs)

Double the parameters? Loss drops predictably. Ten-x the compute? You can estimate exactly how much better the model gets. These relationships hold over many orders of magnitude — from millions to hundreds of billions of parameters.

Chinchilla: the wake-up call

Hoffmann et al. (2022) at DeepMind trained 400+ models and found that most large models were seriously undertrained. The compute-optimal ratio:

optimal tokens ~ 20 * parameters
# GPT-3 (175B params) trained on 300B tokens → undertrained
# Chinchilla (70B params) trained on 1.4T tokens → matched GPT-3

This shifted the field: instead of making models bigger and bigger with limited data, train smaller models on much more data. LLaMA, Mistral, and most modern open models follow this insight — smaller but trained on trillions of tokens.

Compute-optimal training

Given a fixed compute budget C, there’s an optimal split between model size N and data D:

N_opt ~ C^0.5    # scale params with sqrt of compute
D_opt ~ C^0.5    # scale data equally

Spend half your budget on a bigger model and half on more data. Before Chinchilla, the field spent disproportionately on model size.

Emergence

Some capabilities appear to suddenly “turn on” at certain scales — models below a threshold get 0% on a task, then jump to high accuracy. Examples: multi-step arithmetic, chain-of-thought reasoning, code generation.

But this is debated. Schaeffer et al. (2023) argued that emergence is partly an artifact of how we measure — switch from accuracy (discontinuous) to log-likelihood (continuous) and the “sudden” jump smooths out. The reality is probably somewhere in between: capabilities improve gradually, but usability thresholds are real.

Landmark models

ModelYearParamsTokensKey insight
GPT-220191.5B40BScale unlocks generation quality
GPT-32020175B300BFew-shot learning emerges at scale
Chinchilla202270B1.4TCompute-optimal = more data, fewer params
LLaMA20237-65B1-1.4TOpen model following Chinchilla ratios
Mistral 7B20237BundisclosedPunches above its weight with good data
LLaMA 320248-405B15T+Massive data scaling on efficient models

Why it matters

Scaling laws let you plan: estimate the compute needed for a target performance, decide model size vs data tradeoffs, and predict whether scaling further is worth the cost. They’re the closest thing ML has to engineering equations.

Key papers

  • Scaling Laws for Neural Language Models (Kaplan et al., 2020)
  • Training Compute-Optimal Large Language Models (Hoffmann et al., 2022) — Chinchilla