Scaling Laws

What

Model performance improves predictably with more compute, data, and parameters following power-law relationships. This means you can forecast how good a model will be before training it — and decide how to allocate your budget.

Power-law relationships

Kaplan et al. (2020) at OpenAI showed that loss decreases as a smooth power law in three variables:

L(N) ~ N^(-0.076)    # N = number of parameters
L(D) ~ D^(-0.095)    # D = dataset size (tokens)
L(C) ~ C^(-0.050)    # C = compute budget (FLOPs)

Double the parameters? Loss drops predictably. Ten-x the compute? You can estimate exactly how much better the model gets. These relationships hold over many orders of magnitude — from millions to hundreds of billions of parameters.

Chinchilla: the wake-up call

Hoffmann et al. (2022) at DeepMind trained 400+ models and found that most large models were seriously undertrained. The compute-optimal ratio:

optimal tokens ~ 20 * parameters
# GPT-3 (175B params) trained on 300B tokens → undertrained
# Chinchilla (70B params) trained on 1.4T tokens → matched GPT-3

This shifted the field: instead of making models bigger and bigger with limited data, train smaller models on much more data. LLaMA, Mistral, and most modern open models follow this insight — smaller but trained on trillions of tokens.

Compute-optimal training

Given a fixed compute budget C, there’s an optimal split between model size N and data D:

N_opt ~ C^0.5    # scale params with sqrt of compute
D_opt ~ C^0.5    # scale data equally

Spend half your budget on a bigger model and half on more data. Before Chinchilla, the field spent disproportionately on model size.

Emergence

Some capabilities appear to suddenly “turn on” at certain scales — models below a threshold get 0% on a task, then jump to high accuracy. Examples: multi-step arithmetic, chain-of-thought reasoning, code generation.

But this is debated. Schaeffer et al. (2023) argued that emergence is partly an artifact of how we measure — switch from accuracy (discontinuous) to log-likelihood (continuous) and the “sudden” jump smooths out. The reality is probably somewhere in between: capabilities improve gradually, but usability thresholds are real.

Landmark models

Model	Year	Params	Tokens	Key insight
GPT-2	2019	1.5B	40B	Scale unlocks generation quality
GPT-3	2020	175B	300B	Few-shot learning emerges at scale
Chinchilla	2022	70B	1.4T	Compute-optimal = more data, fewer params
LLaMA	2023	7-65B	1-1.4T	Open model following Chinchilla ratios
Mistral 7B	2023	7B	undisclosed	Punches above its weight with good data
LLaMA 3	2024	8-405B	15T+	Massive data scaling on efficient models

Why it matters

Scaling laws let you plan: estimate the compute needed for a target performance, decide model size vs data tradeoffs, and predict whether scaling further is worth the cost. They’re the closest thing ML has to engineering equations.

Key papers

Scaling Laws for Neural Language Models (Kaplan et al., 2020)
Training Compute-Optimal Large Language Models (Hoffmann et al., 2022) — Chinchilla

AI/ML Notes

Explorer

Scaling Laws

Scaling Laws

What

Power-law relationships

Chinchilla: the wake-up call

Compute-optimal training

Emergence

Landmark models

Why it matters

Key papers

Links

Graph View

Table of Contents

Backlinks