Transformers

What

The Transformer architecture, introduced in “Attention Is All You Need” (Vaswani et al., 2017), replaced recurrent networks as the dominant architecture for sequence modeling. Its core innovation is the self-attention mechanism — every element of a sequence attends directly to every other element, regardless of distance.

The Transformer is not one architecture but a family:

Variant	Processes Input	Generates Output	Examples
Encoder-only	Bidirectional (sees full context)	Fixed-size representation	BERT, RoBERTa, DeBERTa
Decoder-only	Causal (left-to-right only)	Autoregressive tokens	GPT-2/3/4, LLaMA, Claude
Encoder-decoder	Bidirectional	Autoregressive tokens	T5, BART, UL2

Architecture

Encoder Block

Each encoder layer has two sublayers:

Input → [Self-Attention] → [Add & Norm] → [Feed-Forward] → [Add & Norm] → Output

1. Multi-Head Self-Attention:

The attention operation maps queries (Q), keys (K), and values (V) to outputs:

Attention(Q, K, V) = softmax(QK^T / √d_k) V

Where:

Q, K, V are matrices of queries, keys, values (each token produces one)
d_k is the key dimension (scales the dot product)
The softmax produces a weighted average of value vectors

Multi-head means running this attention in parallel with multiple heads, each learning different aspects:

MultiHead = Concat(head_1, ..., head_h) W^O
where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)

Typical: h=16 heads, d_model=512 → each head works in d_k=64 dimensional space.

2. Feed-Forward Network:

Two linear transforms with ReLU:

FFN(x) = max(0, xW_1 + b_1) W_2 + b_2

Typically: d_model=512 → d_ff=2048 → d_model=512 (4x expansion)

The FFN is applied identically to each position — it’s position-wise. This is where the “thinking” happens — attention aggregates information, FFN transforms it.

3. Residual Connections + LayerNorm:

Each sublayer wraps itself with:

LayerNorm(x + Sublayer(x))

The skip connection lets gradients flow directly, enabling deeper networks.

Decoder Block

The decoder has three sublayers per block:

Input → [Masked Self-Attention] → [Add & Norm] → 
        [Cross-Attention (attends to encoder)] → [Add & Norm] →
        [Feed-Forward] → [Add & Norm] → Output

Masked self-attention: Prevents attending to future tokens during training (triangular mask). This is what makes autoregressive generation possible — the model must predict each token without seeing the answer.

Cross-attention: Queries come from the decoder; keys and values come from the encoder output. This is how the decoder accesses the source context for translation/summarization.

Positional Encoding

Attention is inherently position-agnostic — “the” at position 1 and “the” at position 5 are treated identically. Position information must be injected.

Original (Vaswani et al.): Sinusoidal encoding:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Each position gets a unique d_model-dimensional vector. The sinusoidal encoding was chosen because it allows the model to learn to attend to relative positions (because sin(a+b) and cos(a+b) can be expressed as functions of a and b).

Modern alternatives:

RoPE (Rotary Position Embedding) — LLaMA, PaLM: Encodes position as rotation in 2D space. Better for length extrapolation.
ALiBi (Attention with Linear Biases) — Scales attention scores by distance. No learned position embeddings.

Training

Pre-Training Objectives

Model Type	Pre-training Objective	Task
BERT	Masked Language Modeling (MLM)	Fill in masked tokens (15% masked, 80% of those replaced with [MASK])
GPT-2/3	Next Token Prediction	Standard autoregressive LM
T5	Span corruption	Replace random spans with sentinel tokens, predict the spans
BART	Denoising	Corrupt text various ways, reconstruct original

The BERT-specific details:

15% of tokens masked
80% replaced with [MASK]
10% replaced with random token
10% unchanged
Forces model to learn from context even when token is replaced

Critical Training Details

Learning rate schedule: Warm up for first ~10K steps (increase from 0 to peak LR), then decay. The warmup prevents early instability.

Label smoothing: CrossEntropyLoss with epsilon=0.1 — instead of hard labels (0 or 1), use soft labels (0.9 for correct class, 0.1 spread across others). Prevents overconfidence.

Mixed precision: FP16/BF16 training cuts memory ~50% with minimal speed loss. Modern training uses BF16 (brain float) which has better dynamic range than FP16.

Key Innovations Post-2017

Pre-LN Transformer

Move LayerNorm inside the residual branch:

LayerNorm(x + Sublayer(x)) → x + Sublayer(LayerNorm(x))

This is now standard — more stable training, easier to optimize. Original paper used Post-LN.

Flash Attention (Dao et al., 2022-2023)

Standard attention is O(N²) memory — the attention matrix (N×N) for N tokens is the bottleneck. Flash Attention computes attention in tiles that fit in GPU SRAM, streaming through HBM.

Result: 2-4x speedup, 10-20x memory reduction. Enabled training on much longer sequences.

Key paper: Dao (2023) — “FlashAttention-2” — https://arxiv.org/abs/2307.08691

Mixture of Experts (MoE)

Instead of activating all parameters for every token, route each token to a subset of “expert” FFN networks:

Sparse MoE: Only top-k experts activated per token (e.g., top-2 of 8)
Example: Mixtral 8x7B = 8 experts, 2 active per token = effectively a 12B dense model
Enables massive parameter counts without proportional compute cost

Key paper: Shazeer et al. (2017) — “Sparsely-Gated MoE” — https://arxiv.org/abs/1701.06538

Why Transformers Won

vs RNNs (LSTM, GRU)

Aspect	RNN	Transformer
Long-range dependencies	Gradient decay, O(N) sequential	Direct attention, O(1) paths to any token
Parallelization	Must process sequentially	All tokens in parallel
Training	Hard to optimize deep stacks	Residual connections help
Inference	Fast (O(1) per token)	Slow (attention over all previous tokens)

The Transformer won primarily because of parallelization during training. Even though inference is O(N) per token, the ability to train on massive corpora in parallel dominated.

vs CNNs

CNNs have O(1) receptive field growth per layer (a 3×3 conv needs log_N layers to see the full image). Transformers have O(1) attention to any position from layer 1. For long-range dependencies, this matters.

Variants

Encoder-Only (BERT Family)

BERT (Devlin et al., 2018) — 110M-340M params, MLM pre-training
RoBERTa — BERT with better training (more data, longer, no next sentence prediction)
DeBERTa — Disentangled attention + enhanced mask decoder
ALBERT — Parameter sharing across layers (smaller, not faster)

Decoder-Only (GPT Family)

GPT-2 (Radford et al., 2019) — 1.5B params, first showing emergent abilities
GPT-3 (Brown et al., 2020) — 175B params, in-context learning
LLaMA (Touvron et al., 2023) — Open weights, efficient training
LLaMA 2/3 — RLHF-aligned, 70B max
Mistral/Mixtral — Mixture of experts variants

Encoder-Decoder

T5 (Raffel et al., 2020) — Text-to-text unified framework
BART (Lewis et al., 2020) — Denoising pre-training
UL2 — Mixture of denoising objectives

Key Papers

Vaswani et al. (2017) — “Attention Is All You Need” — https://arxiv.org/abs/1706.03762
Devlin et al. (2019) — “BERT: Pre-training…” — https://arxiv.org/abs/1810.04805
Radford et al. (2019) — “Language Models are Unsupervised Multitask Learners” (GPT-2) — https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
Brown et al. (2020) — “Language Models are Few-Shot Learners” (GPT-3) — https://arxiv.org/abs/2005.14165
Dao (2023) — “FlashAttention-2” — https://arxiv.org/abs/2307.08691
Shazeer et al. (2017) — “Sparsely-Gated MoE” — https://arxiv.org/abs/1701.06538

AI/ML Notes

Explorer

Transformers

Transformers

What

Architecture

Encoder Block

Decoder Block

Positional Encoding

Training

Pre-Training Objectives

The BERT-specific details:

Critical Training Details

Key Innovations Post-2017

Pre-LN Transformer

Flash Attention (Dao et al., 2022-2023)

Mixture of Experts (MoE)

Why Transformers Won

vs RNNs (LSTM, GRU)

vs CNNs

Variants

Encoder-Only (BERT Family)

Decoder-Only (GPT Family)

Encoder-Decoder

Key Papers

Links

Graph View

Table of Contents

Backlinks