Transformers

What

The Transformer architecture, introduced in “Attention Is All You Need” (Vaswani et al., 2017), replaced recurrent networks as the dominant architecture for sequence modeling. Its core innovation is the self-attention mechanism — every element of a sequence attends directly to every other element, regardless of distance.

The Transformer is not one architecture but a family:

VariantProcesses InputGenerates OutputExamples
Encoder-onlyBidirectional (sees full context)Fixed-size representationBERT, RoBERTa, DeBERTa
Decoder-onlyCausal (left-to-right only)Autoregressive tokensGPT-2/3/4, LLaMA, Claude
Encoder-decoderBidirectionalAutoregressive tokensT5, BART, UL2

Architecture

Encoder Block

Each encoder layer has two sublayers:

Input → [Self-Attention] → [Add & Norm] → [Feed-Forward] → [Add & Norm] → Output

1. Multi-Head Self-Attention:

The attention operation maps queries (Q), keys (K), and values (V) to outputs:

Attention(Q, K, V) = softmax(QK^T / √d_k) V

Where:

  • Q, K, V are matrices of queries, keys, values (each token produces one)
  • d_k is the key dimension (scales the dot product)
  • The softmax produces a weighted average of value vectors

Multi-head means running this attention in parallel with multiple heads, each learning different aspects:

MultiHead = Concat(head_1, ..., head_h) W^O
where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)

Typical: h=16 heads, d_model=512 → each head works in d_k=64 dimensional space.

2. Feed-Forward Network:

Two linear transforms with ReLU:

FFN(x) = max(0, xW_1 + b_1) W_2 + b_2

Typically: d_model=512 → d_ff=2048 → d_model=512 (4x expansion)

The FFN is applied identically to each position — it’s position-wise. This is where the “thinking” happens — attention aggregates information, FFN transforms it.

3. Residual Connections + LayerNorm:

Each sublayer wraps itself with:

LayerNorm(x + Sublayer(x))

The skip connection lets gradients flow directly, enabling deeper networks.

Decoder Block

The decoder has three sublayers per block:

Input → [Masked Self-Attention] → [Add & Norm] → 
        [Cross-Attention (attends to encoder)] → [Add & Norm] →
        [Feed-Forward] → [Add & Norm] → Output

Masked self-attention: Prevents attending to future tokens during training (triangular mask). This is what makes autoregressive generation possible — the model must predict each token without seeing the answer.

Cross-attention: Queries come from the decoder; keys and values come from the encoder output. This is how the decoder accesses the source context for translation/summarization.

Positional Encoding

Attention is inherently position-agnostic — “the” at position 1 and “the” at position 5 are treated identically. Position information must be injected.

Original (Vaswani et al.): Sinusoidal encoding:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Each position gets a unique d_model-dimensional vector. The sinusoidal encoding was chosen because it allows the model to learn to attend to relative positions (because sin(a+b) and cos(a+b) can be expressed as functions of a and b).

Modern alternatives:

  • RoPE (Rotary Position Embedding) — LLaMA, PaLM: Encodes position as rotation in 2D space. Better for length extrapolation.
  • ALiBi (Attention with Linear Biases) — Scales attention scores by distance. No learned position embeddings.

Training

Pre-Training Objectives

Model TypePre-training ObjectiveTask
BERTMasked Language Modeling (MLM)Fill in masked tokens (15% masked, 80% of those replaced with [MASK])
GPT-2/3Next Token PredictionStandard autoregressive LM
T5Span corruptionReplace random spans with sentinel tokens, predict the spans
BARTDenoisingCorrupt text various ways, reconstruct original

The BERT-specific details:

  • 15% of tokens masked
  • 80% replaced with [MASK]
  • 10% replaced with random token
  • 10% unchanged
  • Forces model to learn from context even when token is replaced

Critical Training Details

Learning rate schedule: Warm up for first ~10K steps (increase from 0 to peak LR), then decay. The warmup prevents early instability.

Label smoothing: CrossEntropyLoss with epsilon=0.1 — instead of hard labels (0 or 1), use soft labels (0.9 for correct class, 0.1 spread across others). Prevents overconfidence.

Mixed precision: FP16/BF16 training cuts memory ~50% with minimal speed loss. Modern training uses BF16 (brain float) which has better dynamic range than FP16.

Key Innovations Post-2017

Pre-LN Transformer

Move LayerNorm inside the residual branch:

LayerNorm(x + Sublayer(x)) → x + Sublayer(LayerNorm(x))

This is now standard — more stable training, easier to optimize. Original paper used Post-LN.

Flash Attention (Dao et al., 2022-2023)

Standard attention is O(N²) memory — the attention matrix (N×N) for N tokens is the bottleneck. Flash Attention computes attention in tiles that fit in GPU SRAM, streaming through HBM.

Result: 2-4x speedup, 10-20x memory reduction. Enabled training on much longer sequences.

Key paper: Dao (2023) — “FlashAttention-2” — https://arxiv.org/abs/2307.08691

Mixture of Experts (MoE)

Instead of activating all parameters for every token, route each token to a subset of “expert” FFN networks:

  • Sparse MoE: Only top-k experts activated per token (e.g., top-2 of 8)
  • Example: Mixtral 8x7B = 8 experts, 2 active per token = effectively a 12B dense model
  • Enables massive parameter counts without proportional compute cost

Key paper: Shazeer et al. (2017) — “Sparsely-Gated MoE” — https://arxiv.org/abs/1701.06538

Why Transformers Won

vs RNNs (LSTM, GRU)

AspectRNNTransformer
Long-range dependenciesGradient decay, O(N) sequentialDirect attention, O(1) paths to any token
ParallelizationMust process sequentiallyAll tokens in parallel
TrainingHard to optimize deep stacksResidual connections help
InferenceFast (O(1) per token)Slow (attention over all previous tokens)

The Transformer won primarily because of parallelization during training. Even though inference is O(N) per token, the ability to train on massive corpora in parallel dominated.

vs CNNs

CNNs have O(1) receptive field growth per layer (a 3×3 conv needs log_N layers to see the full image). Transformers have O(1) attention to any position from layer 1. For long-range dependencies, this matters.

Variants

Encoder-Only (BERT Family)

  • BERT (Devlin et al., 2018) — 110M-340M params, MLM pre-training
  • RoBERTa — BERT with better training (more data, longer, no next sentence prediction)
  • DeBERTa — Disentangled attention + enhanced mask decoder
  • ALBERT — Parameter sharing across layers (smaller, not faster)

Decoder-Only (GPT Family)

  • GPT-2 (Radford et al., 2019) — 1.5B params, first showing emergent abilities
  • GPT-3 (Brown et al., 2020) — 175B params, in-context learning
  • LLaMA (Touvron et al., 2023) — Open weights, efficient training
  • LLaMA 2/3 — RLHF-aligned, 70B max
  • Mistral/Mixtral — Mixture of experts variants

Encoder-Decoder

  • T5 (Raffel et al., 2020) — Text-to-text unified framework
  • BART (Lewis et al., 2020) — Denoising pre-training
  • UL2 — Mixture of denoising objectives

Key Papers