Attention Is All You Need
Vaswani et al. (2017)
Why It Matters
Introduced the Transformer architecture. Foundation of all modern LLMs. 173K+ citations.
Key Ideas
- Replace recurrence with self-attention so every token can attend to every other token directly. This shortens the path length between distant dependencies and parallelizes training much better than RNNs.
- Use multi-head attention so the model can learn different relation types in parallel instead of forcing one representation to capture everything.
- Add positional encodings because attention alone is permutation-invariant. Sequence order has to be injected explicitly.
- Combine attention with residual connections, layer normalization, and feed-forward blocks into a stackable architecture that scales cleanly.
Notes
- The paper introduced both the encoder-decoder Transformer for sequence transduction and the scaled dot-product attention formulation.
- Scaling by
1/sqrt(d_k)keeps large dot products from saturating the softmax when key/query dimensionality grows. - The original architecture still has quadratic attention cost in sequence length, which later work tried to reduce. Even so, the paper changed the field because the parallelism and modeling power were so much better.