Language Models
What Is a Language Model
A probability distribution over sequences of tokens. Given a sequence of words (or tokens), a language model assigns a probability to the next word:
P("mat" | "The cat sat on the") = 0.71
P("table" | "The cat sat on the") = 0.23
P("mat" | "The cat sat on the table") = 0.89
This seemingly simple objective — predicting the next token — is extraordinarily powerful. To predict well, the model must understand grammar, semantics, world facts, reasoning patterns, and context. These emergent capabilities are what make large language models useful.
The Evolution of Language Modeling
| Era | Approach | Scale | Key Insight |
|---|---|---|---|
| 1950s-90s | N-gram models (count word sequences) | Small corpora | Simple statistics work |
| 2003 | Neural language models (Bengio et al.) | 10M params | First neural LM, learned representations |
| 2013 | Word2Vec (Mikolov et al.) | ~1B words | Dense embeddings capture semantics |
| 2017 | Attention is All You Need (Transformer) | 65M params | Parallelization + attention beats RNNs |
| 2018 | BERT (Devlin et al.) | 110M-340M params | Bidirectional = better understanding |
| 2019 | GPT-2 (Radford et al.) | 1.5B params | Scale + autoregressive = emergent abilities |
| 2020 | GPT-3 (Brown et al.) | 175B params | In-context learning emerges at scale |
| 2022+ | Instruction tuning + RLHF | Billions | Helpful, harmless, honest |
| 2024+ | Reasoning models (o1, R1) | Chain-of-thought at inference | Test-time compute scaling |
Key paper: Bengio et al. (2003) — “A Neural Probabilistic Language Model” — http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
Two Architectures
Masked Language Models (BERT-style)
Mask random tokens and predict them from both left and right context. The model sees the full sequence simultaneously.
Input: "The cat [MASK] on the mat"
Output: Predict "sat" using full context
Good for: Classification, named entity recognition, sentence embeddings, fill-in-the-blank tasks.
Limitations: Cannot generate long sequences naturally — it’s designed for understanding tasks.
Autoregressive Language Models (GPT-style)
Predict the next token using only left context. Generate sequences token by token.
Input: "The cat sat"
Output: Predict "on" given "The cat sat"
Predict "the" given "The cat sat on"
Predict "mat" given "The cat sat on the"
Good for: Text generation, chat, code synthesis, any generation task.
Limitations: Can only look backward in the sequence (causal attention).
Encoder-Decoder (T5, BART)
Encode input, decode output. Like masked LM but generates variable-length output.
Input: "Translate to French: The cat sat on the mat"
Output: "Le chat était assis sur le tapis"
Good for: Translation, summarization, question answering with long inputs.
Scaling Laws
Kaplan et al. (2020) established the empirical scaling laws:
Performance ∝ N^0.07 (params) × D^0.95 (data) × C^0.05 (compute)
Key findings:
- More parameters help but with diminishing returns
- More data helps more than more parameters
- Compute (training FLOPs) is the strongest predictor
- Critical threshold: abilities (in-context learning, chain-of-thought) emerge unpredictably at certain scales
Key paper: Kaplan et al. (2020) — “Scaling Laws for Neural Language Models” — https://arxiv.org/abs/2001.08361
Emergent Capabilities
At sufficient scale, language models develop capabilities not present in smaller models:
| Capability | Approximate Scale | Example |
|---|---|---|
| Word-in-context | Any | Use a word correctly after seeing definition |
| Chain-of-thought | ~10B+ | Multi-step reasoning visible in output |
| In-context learning | ~10B+ | Learn task from examples in prompt |
| Arithmetic | ~100B+ | Multi-digit addition, word problems |
| Code generation | ~1B+ | Write syntactically correct Python |
| Translation | ~1B+ | Cross-lingual transfer |
Important: Emergence is often an artifact of discontinuous metrics. With smooth metrics, capabilities often appear more gradual.
Modern LLMs
What Makes Current LLMs Different
GPT-3 (2020) was the first to show that scale alone (175B params) produced emergent in-context learning — the ability to perform new tasks from examples in the prompt without fine-tuning.
Modern LLMs (GPT-4, Claude, LLaMA, Gemini) add:
- Instruction tuning — Fine-tune on human-written instruction-response pairs
- RLHF/DPO — Align with human preferences (see RLHF and Alignment)
- Chain-of-thought prompting — Models learn to reason step by step
- Multimodal — Process images, audio, video alongside text
- Long context — 128K+ token windows (Gemini 1.5 Pro: 1M tokens)
The Assistant Paradigm
The standard format for chat-capable LLMs:
System: "You are a helpful assistant."
User: "What is 2+2?"
Model: "2+2 equals 4."
The system prompt sets the model’s persona and behavior. The model then responds to each user message while maintaining conversation context.
Key Papers
- Bengio et al. (2003) — Neural LM foundation — http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
- Mikolov et al. (2013) — Word2Vec — https://arxiv.org/abs/1301.3781
- Vaswani et al. (2017) — Attention/Transformer — https://arxiv.org/abs/1706.03762
- Devlin et al. (2019) — BERT — https://arxiv.org/abs/1810.04805
- Radford et al. (2019) — GPT-2 — https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
- Brown et al. (2020) — GPT-3/in-context learning — https://arxiv.org/abs/2005.14165
- Kaplan et al. (2020) — Scaling laws — https://arxiv.org/abs/2001.08361
Links
- Transformers — The architecture underlying modern language models
- BERT and Masked Language Models — The encoder-only approach
- Text Generation — How autoregressive models produce output
- Fine-Tuning LLMs — Adapting base models to tasks
- RLHF and Alignment — Making models helpful and safe
- Prompt Engineering — Getting the best outputs
- In-Context Learning — The surprising ability to learn from examples