Attention Mechanism

What

A way for a model to focus on the most relevant parts of the input when producing each part of the output. “Pay attention to what matters.”

Scaled Dot-Product Attention

Attention(Q, K, V) = softmax(Q × Kᵀ / √d_k) × V
  • Query (Q): “what am I looking for?”
  • Key (K): “what do I contain?”
  • Value (V): “what information do I provide?”
  • Q × Kᵀ: how relevant is each key to each query (Dot Product as similarity)
  • softmax: normalize to attention weights (sum to 1)
  • × V: weighted sum of values based on attention weights

Self-attention

When Q, K, V all come from the same sequence → the sequence attends to itself. Each token looks at all other tokens to decide what’s important.

Multi-head attention

Run attention multiple times with different learned projections. Each “head” can attend to different types of relationships (syntax, semantics, position).

import torch.nn as nn
 
attn = nn.MultiheadAttention(embed_dim=512, num_heads=8)
output, weights = attn(query, key, value)

Why attention is revolutionary

  • Parallel: all positions computed at once (unlike RNNs)
  • Global: every token can attend to every other token directly
  • Interpretable: attention weights show what the model focuses on