Scaling Laws for Neural Language Models
Kaplan et al. (2020)
Why It Matters
Loss scales as power law with model size, data, and compute. Empirical justification for the scaling paradigm driving modern LLM development.
Key Ideas
- Language model loss follows predictable power-law trends as model size, dataset size, and compute grow.
- Larger models can be more sample-efficient, so scale changes training efficiency as well as final capability.
- Compute allocation should be based on measured scaling relationships rather than intuition alone.
- The paper makes frontier model planning feel more like curve-fitting engineering than guesswork.
Notes
- Later papers revised the exact compute-optimal balance, but this work established the central fact that scaling is regular and forecastable.
- It is foundational for capability planning and budget decisions in large-model training.