Language Models

What Is a Language Model

A probability distribution over sequences of tokens. Given a sequence of words (or tokens), a language model assigns a probability to the next word:

P("mat" | "The cat sat on the") = 0.71
P("table" | "The cat sat on the") = 0.23
P("mat" | "The cat sat on the table") = 0.89

This seemingly simple objective — predicting the next token — is extraordinarily powerful. To predict well, the model must understand grammar, semantics, world facts, reasoning patterns, and context. These emergent capabilities are what make large language models useful.

The Evolution of Language Modeling

Era	Approach	Scale	Key Insight
1950s-90s	N-gram models (count word sequences)	Small corpora	Simple statistics work
2003	Neural language models (Bengio et al.)	10M params	First neural LM, learned representations
2013	Word2Vec (Mikolov et al.)	~1B words	Dense embeddings capture semantics
2017	Attention is All You Need (Transformer)	65M params	Parallelization + attention beats RNNs
2018	BERT (Devlin et al.)	110M-340M params	Bidirectional = better understanding
2019	GPT-2 (Radford et al.)	1.5B params	Scale + autoregressive = emergent abilities
2020	GPT-3 (Brown et al.)	175B params	In-context learning emerges at scale
2022+	Instruction tuning + RLHF	Billions	Helpful, harmless, honest
2024+	Reasoning models (o1, R1)	Chain-of-thought at inference	Test-time compute scaling

Key paper: Bengio et al. (2003) — “A Neural Probabilistic Language Model” — http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf

Two Architectures

Masked Language Models (BERT-style)

Mask random tokens and predict them from both left and right context. The model sees the full sequence simultaneously.

Input:  "The cat [MASK] on the mat"
Output: Predict "sat" using full context

Good for: Classification, named entity recognition, sentence embeddings, fill-in-the-blank tasks.

Limitations: Cannot generate long sequences naturally — it’s designed for understanding tasks.

Autoregressive Language Models (GPT-style)

Predict the next token using only left context. Generate sequences token by token.

Input:  "The cat sat"
Output: Predict "on" given "The cat sat"
        Predict "the" given "The cat sat on"
        Predict "mat" given "The cat sat on the"

Good for: Text generation, chat, code synthesis, any generation task.

Limitations: Can only look backward in the sequence (causal attention).

Encoder-Decoder (T5, BART)

Encode input, decode output. Like masked LM but generates variable-length output.

Input:  "Translate to French: The cat sat on the mat"
Output: "Le chat était assis sur le tapis"

Good for: Translation, summarization, question answering with long inputs.

Scaling Laws

Kaplan et al. (2020) established the empirical scaling laws:

Performance ∝ N^0.07 (params) × D^0.95 (data) × C^0.05 (compute)

Key findings:

More parameters help but with diminishing returns
More data helps more than more parameters
Compute (training FLOPs) is the strongest predictor
Critical threshold: abilities (in-context learning, chain-of-thought) emerge unpredictably at certain scales

Key paper: Kaplan et al. (2020) — “Scaling Laws for Neural Language Models” — https://arxiv.org/abs/2001.08361

Emergent Capabilities

At sufficient scale, language models develop capabilities not present in smaller models:

Capability	Approximate Scale	Example
Word-in-context	Any	Use a word correctly after seeing definition
Chain-of-thought	~10B+	Multi-step reasoning visible in output
In-context learning	~10B+	Learn task from examples in prompt
Arithmetic	~100B+	Multi-digit addition, word problems
Code generation	~1B+	Write syntactically correct Python
Translation	~1B+	Cross-lingual transfer

Important: Emergence is often an artifact of discontinuous metrics. With smooth metrics, capabilities often appear more gradual.

Modern LLMs

What Makes Current LLMs Different

GPT-3 (2020) was the first to show that scale alone (175B params) produced emergent in-context learning — the ability to perform new tasks from examples in the prompt without fine-tuning.

Modern LLMs (GPT-4, Claude, LLaMA, Gemini) add:

Instruction tuning — Fine-tune on human-written instruction-response pairs
RLHF/DPO — Align with human preferences (see RLHF and Alignment)
Chain-of-thought prompting — Models learn to reason step by step
Multimodal — Process images, audio, video alongside text
Long context — 128K+ token windows (Gemini 1.5 Pro: 1M tokens)

The Assistant Paradigm

The standard format for chat-capable LLMs:

System: "You are a helpful assistant."
User:   "What is 2+2?"
Model:  "2+2 equals 4."

The system prompt sets the model’s persona and behavior. The model then responds to each user message while maintaining conversation context.

Key Papers

Bengio et al. (2003) — Neural LM foundation — http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
Mikolov et al. (2013) — Word2Vec — https://arxiv.org/abs/1301.3781
Vaswani et al. (2017) — Attention/Transformer — https://arxiv.org/abs/1706.03762
Devlin et al. (2019) — BERT — https://arxiv.org/abs/1810.04805
Radford et al. (2019) — GPT-2 — https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
Brown et al. (2020) — GPT-3/in-context learning — https://arxiv.org/abs/2005.14165
Kaplan et al. (2020) — Scaling laws — https://arxiv.org/abs/2001.08361

AI/ML Notes

Explorer

Language Models

Language Models

What Is a Language Model

The Evolution of Language Modeling

Two Architectures

Masked Language Models (BERT-style)

Autoregressive Language Models (GPT-style)

Encoder-Decoder (T5, BART)

Scaling Laws

Emergent Capabilities

Modern LLMs

What Makes Current LLMs Different

The Assistant Paradigm

Key Papers

Links

Graph View

Table of Contents

Backlinks