Attention Mechanism from Scratch
Goal: Build scaled dot-product attention, multi-head attention, and a minimal self-attention layer from scratch. Understand why it works.
Prerequisites: Attention Mechanism, Transformers, Dot Product, Matrix Multiplication
Why Attention?
RNNs process sequences one token at a time — information from early tokens gets diluted. Attention lets every token look at every other token directly. It’s but massively parallelizable.
Scaled Dot-Product Attention
- Q (Query): “What am I looking for?”
- K (Key): “What do I contain?”
- V (Value): “What information do I provide?”
import numpy as np
import matplotlib.pyplot as plt
def softmax(x, axis=-1):
e = np.exp(x - x.max(axis=axis, keepdims=True))
return e / e.sum(axis=axis, keepdims=True)
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Q: (seq_len_q, d_k)
K: (seq_len_k, d_k)
V: (seq_len_k, d_v)
Returns: (seq_len_q, d_v), attention_weights (seq_len_q, seq_len_k)
"""
d_k = Q.shape[-1]
scores = Q @ K.T / np.sqrt(d_k) # (seq_len_q, seq_len_k)
if mask is not None:
scores = np.where(mask, scores, -1e9)
weights = softmax(scores) # (seq_len_q, seq_len_k)
output = weights @ V # (seq_len_q, d_v)
return output, weightsWhy scale by ?
Without scaling, dot products grow with , pushing softmax into extreme values (near 0 or 1). Scaling keeps gradients healthy.
Worked Example
Imagine 4 words: [“The”, “cat”, “sat”, “down”]. Each has a 3-dimensional embedding.
# Fake embeddings for 4 tokens, d_model=3
np.random.seed(42)
seq_len, d_model = 4, 8
X = np.random.randn(seq_len, d_model)
# In self-attention, Q/K/V come from the same input via linear projections
d_k = d_v = 4
W_Q = np.random.randn(d_model, d_k) * 0.5
W_K = np.random.randn(d_model, d_k) * 0.5
W_V = np.random.randn(d_model, d_v) * 0.5
Q = X @ W_Q # (4, 4)
K = X @ W_K # (4, 4)
V = X @ W_V # (4, 4)
output, weights = scaled_dot_product_attention(Q, K, V)
print(f"Input shape: {X.shape}")
print(f"Output shape: {output.shape}")
print(f"Weights shape: {weights.shape}")
print(f"\nAttention weights (each row sums to 1):")
print(weights.round(3))Visualize Attention Weights
tokens = ["The", "cat", "sat", "down"]
plt.figure(figsize=(6, 5))
plt.imshow(weights, cmap="Blues")
plt.xticks(range(4), tokens)
plt.yticks(range(4), tokens)
plt.xlabel("Key (attending to)")
plt.ylabel("Query (from)")
plt.colorbar(label="Attention weight")
plt.title("Self-attention weights")
for i in range(4):
for j in range(4):
plt.text(j, i, f"{weights[i,j]:.2f}", ha="center", va="center", fontsize=10)
plt.show()Each row shows how much one token “pays attention” to every other token (including itself).
Causal Mask (for Autoregressive Models)
In GPT-style models, a token can only attend to previous tokens:
def causal_mask(seq_len):
"""Lower triangular mask — True where attention is allowed."""
return np.tril(np.ones((seq_len, seq_len), dtype=bool))
mask = causal_mask(4)
output_masked, weights_masked = scaled_dot_product_attention(Q, K, V, mask=mask)
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
for ax, w, title in [(axes[0], weights, "Full attention"),
(axes[1], weights_masked, "Causal attention")]:
ax.imshow(w, cmap="Blues", vmin=0, vmax=1)
ax.set_xticks(range(4)); ax.set_xticklabels(tokens)
ax.set_yticks(range(4)); ax.set_yticklabels(tokens)
ax.set_title(title)
for i in range(4):
for j in range(4):
ax.text(j, i, f"{w[i,j]:.2f}", ha="center", va="center", fontsize=10)
plt.tight_layout()
plt.show()“sat” can look at [“The”, “cat”, “sat”] but not “down”.
Multi-Head Attention
Instead of one set of Q/K/V, use heads with different projections. Each head learns different relationships:
class MultiHeadAttention:
def __init__(self, d_model, n_heads):
assert d_model % n_heads == 0
self.n_heads = n_heads
self.d_k = d_model // n_heads
# Projection matrices for each head (combined into one matrix)
self.W_Q = np.random.randn(d_model, d_model) * np.sqrt(2 / d_model)
self.W_K = np.random.randn(d_model, d_model) * np.sqrt(2 / d_model)
self.W_V = np.random.randn(d_model, d_model) * np.sqrt(2 / d_model)
self.W_O = np.random.randn(d_model, d_model) * np.sqrt(2 / d_model)
def split_heads(self, X):
"""(seq_len, d_model) → (n_heads, seq_len, d_k)"""
seq_len = X.shape[0]
X = X.reshape(seq_len, self.n_heads, self.d_k)
return X.transpose(1, 0, 2) # (n_heads, seq_len, d_k)
def forward(self, X, mask=None):
Q = self.split_heads(X @ self.W_Q)
K = self.split_heads(X @ self.W_K)
V = self.split_heads(X @ self.W_V)
# Attention per head
all_outputs = []
all_weights = []
for h in range(self.n_heads):
out, w = scaled_dot_product_attention(Q[h], K[h], V[h], mask)
all_outputs.append(out)
all_weights.append(w)
# Concatenate heads: (n_heads, seq_len, d_k) → (seq_len, d_model)
concat = np.concatenate(all_outputs, axis=-1)
# Final projection
output = concat @ self.W_O
return output, all_weights
# Test
mha = MultiHeadAttention(d_model=8, n_heads=2)
output, head_weights = mha.forward(X)
print(f"Input shape: {X.shape}")
print(f"Output shape: {output.shape}")
print(f"Number of heads: {len(head_weights)}")Visualize per-head attention
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
for h, (ax, w) in enumerate(zip(axes, head_weights)):
ax.imshow(w, cmap="Blues", vmin=0, vmax=1)
ax.set_xticks(range(4)); ax.set_xticklabels(tokens)
ax.set_yticks(range(4)); ax.set_yticklabels(tokens)
ax.set_title(f"Head {h}")
for i in range(4):
for j in range(4):
ax.text(j, i, f"{w[i,j]:.2f}", ha="center", va="center", fontsize=9)
plt.suptitle("Different heads learn different attention patterns")
plt.tight_layout()
plt.show()Positional Encoding
Attention has no notion of order — {A, B, C} and {C, A, B} produce the same result. Positional encoding injects position information:
def positional_encoding(seq_len, d_model):
pe = np.zeros((seq_len, d_model))
pos = np.arange(seq_len)[:, None]
div = 10000 ** (2 * np.arange(d_model // 2)[None, :] / d_model)
pe[:, 0::2] = np.sin(pos / div)
pe[:, 1::2] = np.cos(pos / div)
return pe
pe = positional_encoding(50, 32)
plt.figure(figsize=(12, 4))
plt.imshow(pe.T, aspect="auto", cmap="RdBu")
plt.xlabel("Position"); plt.ylabel("Dimension")
plt.colorbar()
plt.title("Positional encoding — each position has a unique pattern")
plt.show()Why sin/cos?
can be expressed as a linear function of — the model can learn to attend to relative positions.
Putting It Together: Mini Transformer Block
def layer_norm(x, eps=1e-5):
mean = x.mean(axis=-1, keepdims=True)
std = x.std(axis=-1, keepdims=True)
return (x - mean) / (std + eps)
def feed_forward(x, W1, b1, W2, b2):
return np.maximum(0, x @ W1 + b1) @ W2 + b2 # ReLU activation
class TransformerBlock:
def __init__(self, d_model, n_heads, d_ff):
self.mha = MultiHeadAttention(d_model, n_heads)
self.ff_W1 = np.random.randn(d_model, d_ff) * np.sqrt(2 / d_model)
self.ff_b1 = np.zeros(d_ff)
self.ff_W2 = np.random.randn(d_ff, d_model) * np.sqrt(2 / d_ff)
self.ff_b2 = np.zeros(d_model)
def forward(self, X, mask=None):
# Self-attention + residual + norm
attn_out, weights = self.mha.forward(X, mask)
X = layer_norm(X + attn_out)
# Feed-forward + residual + norm
ff_out = feed_forward(X, self.ff_W1, self.ff_b1, self.ff_W2, self.ff_b2)
X = layer_norm(X + ff_out)
return X, weights
# Test
block = TransformerBlock(d_model=8, n_heads=2, d_ff=32)
X_with_pe = X + positional_encoding(4, 8)
out, weights = block.forward(X_with_pe)
print(f"Transformer block: {X_with_pe.shape} → {out.shape}")Exercises
-
Attention is all you need? Remove the feed-forward network from the transformer block. How does this affect the output? The FFN adds “thinking” capacity — attention alone can only mix information.
-
Cross-attention: Implement attention where Q comes from one sequence and K/V from another. This is how encoder-decoder models work (decoder queries the encoder).
-
Relative positions: Instead of adding positional encoding to the input, add a learned relative position bias to the attention scores: .
-
Real self-attention: Take a sentence, embed each word using random vectors, pass through self-attention. Visualize which words attend to which — even random weights will show the “attending to self” pattern.
See Also
- Attention Is All You Need — the original transformer paper (Vaswani et al., 2017). Introduced scaled dot-product attention, multi-head attention, and positional encodings that form the basis of modern LLMs.
Next: 09 - Feature Engineering Cookbook — practical data transforms that make models work.