Attention Mechanism
What
A way for a model to focus on the most relevant parts of the input when producing each part of the output. “Pay attention to what matters.”
Scaled Dot-Product Attention
Attention(Q, K, V) = softmax(Q × Kᵀ / √d_k) × V
- Query (Q): “what am I looking for?”
- Key (K): “what do I contain?”
- Value (V): “what information do I provide?”
- Q × Kᵀ: how relevant is each key to each query (Dot Product as similarity)
- softmax: normalize to attention weights (sum to 1)
- × V: weighted sum of values based on attention weights
Self-attention
When Q, K, V all come from the same sequence → the sequence attends to itself. Each token looks at all other tokens to decide what’s important.
Multi-head attention
Run attention multiple times with different learned projections. Each “head” can attend to different types of relationships (syntax, semantics, position).
import torch.nn as nn
attn = nn.MultiheadAttention(embed_dim=512, num_heads=8)
output, weights = attn(query, key, value)Why attention is revolutionary
- Parallel: all positions computed at once (unlike RNNs)
- Global: every token can attend to every other token directly
- Interpretable: attention weights show what the model focuses on
Links
- Transformers — built entirely on attention
- Dot Product — the core similarity operation
- Recurrent Neural Networks — what attention replaced