Language Models

What Is a Language Model

A probability distribution over sequences of tokens. Given a sequence of words (or tokens), a language model assigns a probability to the next word:

P("mat" | "The cat sat on the") = 0.71
P("table" | "The cat sat on the") = 0.23
P("mat" | "The cat sat on the table") = 0.89

This seemingly simple objective — predicting the next token — is extraordinarily powerful. To predict well, the model must understand grammar, semantics, world facts, reasoning patterns, and context. These emergent capabilities are what make large language models useful.

The Evolution of Language Modeling

EraApproachScaleKey Insight
1950s-90sN-gram models (count word sequences)Small corporaSimple statistics work
2003Neural language models (Bengio et al.)10M paramsFirst neural LM, learned representations
2013Word2Vec (Mikolov et al.)~1B wordsDense embeddings capture semantics
2017Attention is All You Need (Transformer)65M paramsParallelization + attention beats RNNs
2018BERT (Devlin et al.)110M-340M paramsBidirectional = better understanding
2019GPT-2 (Radford et al.)1.5B paramsScale + autoregressive = emergent abilities
2020GPT-3 (Brown et al.)175B paramsIn-context learning emerges at scale
2022+Instruction tuning + RLHFBillionsHelpful, harmless, honest
2024+Reasoning models (o1, R1)Chain-of-thought at inferenceTest-time compute scaling

Key paper: Bengio et al. (2003) — “A Neural Probabilistic Language Model” — http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf

Two Architectures

Masked Language Models (BERT-style)

Mask random tokens and predict them from both left and right context. The model sees the full sequence simultaneously.

Input:  "The cat [MASK] on the mat"
Output: Predict "sat" using full context

Good for: Classification, named entity recognition, sentence embeddings, fill-in-the-blank tasks.

Limitations: Cannot generate long sequences naturally — it’s designed for understanding tasks.

Autoregressive Language Models (GPT-style)

Predict the next token using only left context. Generate sequences token by token.

Input:  "The cat sat"
Output: Predict "on" given "The cat sat"
        Predict "the" given "The cat sat on"
        Predict "mat" given "The cat sat on the"

Good for: Text generation, chat, code synthesis, any generation task.

Limitations: Can only look backward in the sequence (causal attention).

Encoder-Decoder (T5, BART)

Encode input, decode output. Like masked LM but generates variable-length output.

Input:  "Translate to French: The cat sat on the mat"
Output: "Le chat était assis sur le tapis"

Good for: Translation, summarization, question answering with long inputs.

Scaling Laws

Kaplan et al. (2020) established the empirical scaling laws:

Performance ∝ N^0.07 (params) × D^0.95 (data) × C^0.05 (compute)

Key findings:

  • More parameters help but with diminishing returns
  • More data helps more than more parameters
  • Compute (training FLOPs) is the strongest predictor
  • Critical threshold: abilities (in-context learning, chain-of-thought) emerge unpredictably at certain scales

Key paper: Kaplan et al. (2020) — “Scaling Laws for Neural Language Models” — https://arxiv.org/abs/2001.08361

Emergent Capabilities

At sufficient scale, language models develop capabilities not present in smaller models:

CapabilityApproximate ScaleExample
Word-in-contextAnyUse a word correctly after seeing definition
Chain-of-thought~10B+Multi-step reasoning visible in output
In-context learning~10B+Learn task from examples in prompt
Arithmetic~100B+Multi-digit addition, word problems
Code generation~1B+Write syntactically correct Python
Translation~1B+Cross-lingual transfer

Important: Emergence is often an artifact of discontinuous metrics. With smooth metrics, capabilities often appear more gradual.

Modern LLMs

What Makes Current LLMs Different

GPT-3 (2020) was the first to show that scale alone (175B params) produced emergent in-context learning — the ability to perform new tasks from examples in the prompt without fine-tuning.

Modern LLMs (GPT-4, Claude, LLaMA, Gemini) add:

  1. Instruction tuning — Fine-tune on human-written instruction-response pairs
  2. RLHF/DPO — Align with human preferences (see RLHF and Alignment)
  3. Chain-of-thought prompting — Models learn to reason step by step
  4. Multimodal — Process images, audio, video alongside text
  5. Long context — 128K+ token windows (Gemini 1.5 Pro: 1M tokens)

The Assistant Paradigm

The standard format for chat-capable LLMs:

System: "You are a helpful assistant."
User:   "What is 2+2?"
Model:  "2+2 equals 4."

The system prompt sets the model’s persona and behavior. The model then responds to each user message while maintaining conversation context.

Key Papers