Attention Mechanism

What

A way for a model to focus on the most relevant parts of the input when producing each part of the output. “Pay attention to what matters.”

Scaled Dot-Product Attention

Attention(Q, K, V) = softmax(Q × Kᵀ / √d_k) × V

Query (Q): “what am I looking for?”
Key (K): “what do I contain?”
Value (V): “what information do I provide?”
Q × Kᵀ: how relevant is each key to each query (Dot Product as similarity)
softmax: normalize to attention weights (sum to 1)
× V: weighted sum of values based on attention weights

Self-attention

When Q, K, V all come from the same sequence → the sequence attends to itself. Each token looks at all other tokens to decide what’s important.

Multi-head attention

Run attention multiple times with different learned projections. Each “head” can attend to different types of relationships (syntax, semantics, position).

import torch.nn as nn
 
attn = nn.MultiheadAttention(embed_dim=512, num_heads=8)
output, weights = attn(query, key, value)

Why attention is revolutionary

Parallel: all positions computed at once (unlike RNNs)
Global: every token can attend to every other token directly
Interpretable: attention weights show what the model focuses on

AI/ML Notes

Explorer

Attention Mechanism

Attention Mechanism

What

Scaled Dot-Product Attention

Self-attention

Multi-head attention

Why attention is revolutionary

Links

Graph View

Table of Contents

Backlinks