Transformers
What
The Transformer architecture, introduced in “Attention Is All You Need” (Vaswani et al., 2017), replaced recurrent networks as the dominant architecture for sequence modeling. Its core innovation is the self-attention mechanism — every element of a sequence attends directly to every other element, regardless of distance.
The Transformer is not one architecture but a family:
| Variant | Processes Input | Generates Output | Examples |
|---|---|---|---|
| Encoder-only | Bidirectional (sees full context) | Fixed-size representation | BERT, RoBERTa, DeBERTa |
| Decoder-only | Causal (left-to-right only) | Autoregressive tokens | GPT-2/3/4, LLaMA, Claude |
| Encoder-decoder | Bidirectional | Autoregressive tokens | T5, BART, UL2 |
Architecture
Encoder Block
Each encoder layer has two sublayers:
Input → [Self-Attention] → [Add & Norm] → [Feed-Forward] → [Add & Norm] → Output
1. Multi-Head Self-Attention:
The attention operation maps queries (Q), keys (K), and values (V) to outputs:
Attention(Q, K, V) = softmax(QK^T / √d_k) V
Where:
- Q, K, V are matrices of queries, keys, values (each token produces one)
- d_k is the key dimension (scales the dot product)
- The softmax produces a weighted average of value vectors
Multi-head means running this attention in parallel with multiple heads, each learning different aspects:
MultiHead = Concat(head_1, ..., head_h) W^O
where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
Typical: h=16 heads, d_model=512 → each head works in d_k=64 dimensional space.
2. Feed-Forward Network:
Two linear transforms with ReLU:
FFN(x) = max(0, xW_1 + b_1) W_2 + b_2
Typically: d_model=512 → d_ff=2048 → d_model=512 (4x expansion)
The FFN is applied identically to each position — it’s position-wise. This is where the “thinking” happens — attention aggregates information, FFN transforms it.
3. Residual Connections + LayerNorm:
Each sublayer wraps itself with:
LayerNorm(x + Sublayer(x))
The skip connection lets gradients flow directly, enabling deeper networks.
Decoder Block
The decoder has three sublayers per block:
Input → [Masked Self-Attention] → [Add & Norm] →
[Cross-Attention (attends to encoder)] → [Add & Norm] →
[Feed-Forward] → [Add & Norm] → Output
Masked self-attention: Prevents attending to future tokens during training (triangular mask). This is what makes autoregressive generation possible — the model must predict each token without seeing the answer.
Cross-attention: Queries come from the decoder; keys and values come from the encoder output. This is how the decoder accesses the source context for translation/summarization.
Positional Encoding
Attention is inherently position-agnostic — “the” at position 1 and “the” at position 5 are treated identically. Position information must be injected.
Original (Vaswani et al.): Sinusoidal encoding:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
Each position gets a unique d_model-dimensional vector. The sinusoidal encoding was chosen because it allows the model to learn to attend to relative positions (because sin(a+b) and cos(a+b) can be expressed as functions of a and b).
Modern alternatives:
- RoPE (Rotary Position Embedding) — LLaMA, PaLM: Encodes position as rotation in 2D space. Better for length extrapolation.
- ALiBi (Attention with Linear Biases) — Scales attention scores by distance. No learned position embeddings.
Training
Pre-Training Objectives
| Model Type | Pre-training Objective | Task |
|---|---|---|
| BERT | Masked Language Modeling (MLM) | Fill in masked tokens (15% masked, 80% of those replaced with [MASK]) |
| GPT-2/3 | Next Token Prediction | Standard autoregressive LM |
| T5 | Span corruption | Replace random spans with sentinel tokens, predict the spans |
| BART | Denoising | Corrupt text various ways, reconstruct original |
The BERT-specific details:
- 15% of tokens masked
- 80% replaced with [MASK]
- 10% replaced with random token
- 10% unchanged
- Forces model to learn from context even when token is replaced
Critical Training Details
Learning rate schedule: Warm up for first ~10K steps (increase from 0 to peak LR), then decay. The warmup prevents early instability.
Label smoothing: CrossEntropyLoss with epsilon=0.1 — instead of hard labels (0 or 1), use soft labels (0.9 for correct class, 0.1 spread across others). Prevents overconfidence.
Mixed precision: FP16/BF16 training cuts memory ~50% with minimal speed loss. Modern training uses BF16 (brain float) which has better dynamic range than FP16.
Key Innovations Post-2017
Pre-LN Transformer
Move LayerNorm inside the residual branch:
LayerNorm(x + Sublayer(x)) → x + Sublayer(LayerNorm(x))
This is now standard — more stable training, easier to optimize. Original paper used Post-LN.
Flash Attention (Dao et al., 2022-2023)
Standard attention is O(N²) memory — the attention matrix (N×N) for N tokens is the bottleneck. Flash Attention computes attention in tiles that fit in GPU SRAM, streaming through HBM.
Result: 2-4x speedup, 10-20x memory reduction. Enabled training on much longer sequences.
Key paper: Dao (2023) — “FlashAttention-2” — https://arxiv.org/abs/2307.08691
Mixture of Experts (MoE)
Instead of activating all parameters for every token, route each token to a subset of “expert” FFN networks:
- Sparse MoE: Only top-k experts activated per token (e.g., top-2 of 8)
- Example: Mixtral 8x7B = 8 experts, 2 active per token = effectively a 12B dense model
- Enables massive parameter counts without proportional compute cost
Key paper: Shazeer et al. (2017) — “Sparsely-Gated MoE” — https://arxiv.org/abs/1701.06538
Why Transformers Won
vs RNNs (LSTM, GRU)
| Aspect | RNN | Transformer |
|---|---|---|
| Long-range dependencies | Gradient decay, O(N) sequential | Direct attention, O(1) paths to any token |
| Parallelization | Must process sequentially | All tokens in parallel |
| Training | Hard to optimize deep stacks | Residual connections help |
| Inference | Fast (O(1) per token) | Slow (attention over all previous tokens) |
The Transformer won primarily because of parallelization during training. Even though inference is O(N) per token, the ability to train on massive corpora in parallel dominated.
vs CNNs
CNNs have O(1) receptive field growth per layer (a 3×3 conv needs log_N layers to see the full image). Transformers have O(1) attention to any position from layer 1. For long-range dependencies, this matters.
Variants
Encoder-Only (BERT Family)
- BERT (Devlin et al., 2018) — 110M-340M params, MLM pre-training
- RoBERTa — BERT with better training (more data, longer, no next sentence prediction)
- DeBERTa — Disentangled attention + enhanced mask decoder
- ALBERT — Parameter sharing across layers (smaller, not faster)
Decoder-Only (GPT Family)
- GPT-2 (Radford et al., 2019) — 1.5B params, first showing emergent abilities
- GPT-3 (Brown et al., 2020) — 175B params, in-context learning
- LLaMA (Touvron et al., 2023) — Open weights, efficient training
- LLaMA 2/3 — RLHF-aligned, 70B max
- Mistral/Mixtral — Mixture of experts variants
Encoder-Decoder
- T5 (Raffel et al., 2020) — Text-to-text unified framework
- BART (Lewis et al., 2020) — Denoising pre-training
- UL2 — Mixture of denoising objectives
Key Papers
- Vaswani et al. (2017) — “Attention Is All You Need” — https://arxiv.org/abs/1706.03762
- Devlin et al. (2019) — “BERT: Pre-training…” — https://arxiv.org/abs/1810.04805
- Radford et al. (2019) — “Language Models are Unsupervised Multitask Learners” (GPT-2) — https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
- Brown et al. (2020) — “Language Models are Few-Shot Learners” (GPT-3) — https://arxiv.org/abs/2005.14165
- Dao (2023) — “FlashAttention-2” — https://arxiv.org/abs/2307.08691
- Shazeer et al. (2017) — “Sparsely-Gated MoE” — https://arxiv.org/abs/1701.06538
Links
- Attention Mechanism — The core operation inside transformers
- Embeddings — Input representation (including positional encoding)
- Language Models — How decoder-only transformers are trained
- BERT and Masked Language Models — Encoder-only approach
- RLHF and Alignment — How transformers are fine-tuned to be helpful
- Scaling Laws — How transformer performance scales with size
- Recurrent Neural Networks — What transformers replaced