Attention Mechanism from Scratch

Goal: Build scaled dot-product attention, multi-head attention, and a minimal self-attention layer from scratch. Understand why it works.

Prerequisites: Attention Mechanism, Transformers, Dot Product, Matrix Multiplication

Why Attention?

RNNs process sequences one token at a time — information from early tokens gets diluted. Attention lets every token look at every other token directly. It’s $O (n^{2})$ but massively parallelizable.

Scaled Dot-Product Attention

$Attention (Q, K, V) = softmax (\frac{Q K ^{T}}{d _{k}}) V$

Q (Query): “What am I looking for?”
K (Key): “What do I contain?”
V (Value): “What information do I provide?”

import numpy as np
import matplotlib.pyplot as plt
 
def softmax(x, axis=-1):
    e = np.exp(x - x.max(axis=axis, keepdims=True))
    return e / e.sum(axis=axis, keepdims=True)
 
def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q: (seq_len_q, d_k)
    K: (seq_len_k, d_k)
    V: (seq_len_k, d_v)
    Returns: (seq_len_q, d_v), attention_weights (seq_len_q, seq_len_k)
    """
    d_k = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)   # (seq_len_q, seq_len_k)
 
    if mask is not None:
        scores = np.where(mask, scores, -1e9)
 
    weights = softmax(scores)           # (seq_len_q, seq_len_k)
    output = weights @ V                # (seq_len_q, d_v)
    return output, weights

Why scale by $d_{k}$ ?

Without scaling, dot products grow with $d_{k}$ , pushing softmax into extreme values (near 0 or 1). Scaling keeps gradients healthy.

Worked Example

Imagine 4 words: [“The”, “cat”, “sat”, “down”]. Each has a 3-dimensional embedding.

# Fake embeddings for 4 tokens, d_model=3
np.random.seed(42)
seq_len, d_model = 4, 8
X = np.random.randn(seq_len, d_model)
 
# In self-attention, Q/K/V come from the same input via linear projections
d_k = d_v = 4
W_Q = np.random.randn(d_model, d_k) * 0.5
W_K = np.random.randn(d_model, d_k) * 0.5
W_V = np.random.randn(d_model, d_v) * 0.5
 
Q = X @ W_Q   # (4, 4)
K = X @ W_K   # (4, 4)
V = X @ W_V   # (4, 4)
 
output, weights = scaled_dot_product_attention(Q, K, V)
 
print(f"Input shape:   {X.shape}")
print(f"Output shape:  {output.shape}")
print(f"Weights shape: {weights.shape}")
print(f"\nAttention weights (each row sums to 1):")
print(weights.round(3))

Visualize Attention Weights

tokens = ["The", "cat", "sat", "down"]
 
plt.figure(figsize=(6, 5))
plt.imshow(weights, cmap="Blues")
plt.xticks(range(4), tokens)
plt.yticks(range(4), tokens)
plt.xlabel("Key (attending to)")
plt.ylabel("Query (from)")
plt.colorbar(label="Attention weight")
plt.title("Self-attention weights")
for i in range(4):
    for j in range(4):
        plt.text(j, i, f"{weights[i,j]:.2f}", ha="center", va="center", fontsize=10)
plt.show()

Each row shows how much one token “pays attention” to every other token (including itself).

Causal Mask (for Autoregressive Models)

In GPT-style models, a token can only attend to previous tokens:

def causal_mask(seq_len):
    """Lower triangular mask — True where attention is allowed."""
    return np.tril(np.ones((seq_len, seq_len), dtype=bool))
 
mask = causal_mask(4)
output_masked, weights_masked = scaled_dot_product_attention(Q, K, V, mask=mask)
 
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
for ax, w, title in [(axes[0], weights, "Full attention"),
                      (axes[1], weights_masked, "Causal attention")]:
    ax.imshow(w, cmap="Blues", vmin=0, vmax=1)
    ax.set_xticks(range(4)); ax.set_xticklabels(tokens)
    ax.set_yticks(range(4)); ax.set_yticklabels(tokens)
    ax.set_title(title)
    for i in range(4):
        for j in range(4):
            ax.text(j, i, f"{w[i,j]:.2f}", ha="center", va="center", fontsize=10)
plt.tight_layout()
plt.show()

“sat” can look at [“The”, “cat”, “sat”] but not “down”.

Multi-Head Attention

Instead of one set of Q/K/V, use $h$ heads with different projections. Each head learns different relationships:

class MultiHeadAttention:
    def __init__(self, d_model, n_heads):
        assert d_model % n_heads == 0
        self.n_heads = n_heads
        self.d_k = d_model // n_heads
 
        # Projection matrices for each head (combined into one matrix)
        self.W_Q = np.random.randn(d_model, d_model) * np.sqrt(2 / d_model)
        self.W_K = np.random.randn(d_model, d_model) * np.sqrt(2 / d_model)
        self.W_V = np.random.randn(d_model, d_model) * np.sqrt(2 / d_model)
        self.W_O = np.random.randn(d_model, d_model) * np.sqrt(2 / d_model)
 
    def split_heads(self, X):
        """(seq_len, d_model) → (n_heads, seq_len, d_k)"""
        seq_len = X.shape[0]
        X = X.reshape(seq_len, self.n_heads, self.d_k)
        return X.transpose(1, 0, 2)  # (n_heads, seq_len, d_k)
 
    def forward(self, X, mask=None):
        Q = self.split_heads(X @ self.W_Q)
        K = self.split_heads(X @ self.W_K)
        V = self.split_heads(X @ self.W_V)
 
        # Attention per head
        all_outputs = []
        all_weights = []
        for h in range(self.n_heads):
            out, w = scaled_dot_product_attention(Q[h], K[h], V[h], mask)
            all_outputs.append(out)
            all_weights.append(w)
 
        # Concatenate heads: (n_heads, seq_len, d_k) → (seq_len, d_model)
        concat = np.concatenate(all_outputs, axis=-1)
 
        # Final projection
        output = concat @ self.W_O
        return output, all_weights
 
# Test
mha = MultiHeadAttention(d_model=8, n_heads=2)
output, head_weights = mha.forward(X)
 
print(f"Input shape:  {X.shape}")
print(f"Output shape: {output.shape}")
print(f"Number of heads: {len(head_weights)}")

Visualize per-head attention

fig, axes = plt.subplots(1, 2, figsize=(12, 5))
for h, (ax, w) in enumerate(zip(axes, head_weights)):
    ax.imshow(w, cmap="Blues", vmin=0, vmax=1)
    ax.set_xticks(range(4)); ax.set_xticklabels(tokens)
    ax.set_yticks(range(4)); ax.set_yticklabels(tokens)
    ax.set_title(f"Head {h}")
    for i in range(4):
        for j in range(4):
            ax.text(j, i, f"{w[i,j]:.2f}", ha="center", va="center", fontsize=9)
plt.suptitle("Different heads learn different attention patterns")
plt.tight_layout()
plt.show()

Positional Encoding

Attention has no notion of order — {A, B, C} and {C, A, B} produce the same result. Positional encoding injects position information:

$P E_{(p os, 2 i)} = sin (p os /1000 0^{2 i / d})$ $P E_{(p os, 2 i + 1)} = cos (p os /1000 0^{2 i / d})$

def positional_encoding(seq_len, d_model):
    pe = np.zeros((seq_len, d_model))
    pos = np.arange(seq_len)[:, None]
    div = 10000 ** (2 * np.arange(d_model // 2)[None, :] / d_model)
    pe[:, 0::2] = np.sin(pos / div)
    pe[:, 1::2] = np.cos(pos / div)
    return pe
 
pe = positional_encoding(50, 32)
 
plt.figure(figsize=(12, 4))
plt.imshow(pe.T, aspect="auto", cmap="RdBu")
plt.xlabel("Position"); plt.ylabel("Dimension")
plt.colorbar()
plt.title("Positional encoding — each position has a unique pattern")
plt.show()

Why sin/cos?

$P E_{p os + k}$ can be expressed as a linear function of $P E_{p os}$ — the model can learn to attend to relative positions.

Putting It Together: Mini Transformer Block

def layer_norm(x, eps=1e-5):
    mean = x.mean(axis=-1, keepdims=True)
    std = x.std(axis=-1, keepdims=True)
    return (x - mean) / (std + eps)
 
def feed_forward(x, W1, b1, W2, b2):
    return np.maximum(0, x @ W1 + b1) @ W2 + b2  # ReLU activation
 
class TransformerBlock:
    def __init__(self, d_model, n_heads, d_ff):
        self.mha = MultiHeadAttention(d_model, n_heads)
        self.ff_W1 = np.random.randn(d_model, d_ff) * np.sqrt(2 / d_model)
        self.ff_b1 = np.zeros(d_ff)
        self.ff_W2 = np.random.randn(d_ff, d_model) * np.sqrt(2 / d_ff)
        self.ff_b2 = np.zeros(d_model)
 
    def forward(self, X, mask=None):
        # Self-attention + residual + norm
        attn_out, weights = self.mha.forward(X, mask)
        X = layer_norm(X + attn_out)
 
        # Feed-forward + residual + norm
        ff_out = feed_forward(X, self.ff_W1, self.ff_b1, self.ff_W2, self.ff_b2)
        X = layer_norm(X + ff_out)
 
        return X, weights
 
# Test
block = TransformerBlock(d_model=8, n_heads=2, d_ff=32)
X_with_pe = X + positional_encoding(4, 8)
out, weights = block.forward(X_with_pe)
print(f"Transformer block: {X_with_pe.shape} → {out.shape}")

Exercises

Attention is all you need? Remove the feed-forward network from the transformer block. How does this affect the output? The FFN adds “thinking” capacity — attention alone can only mix information.
Cross-attention: Implement attention where Q comes from one sequence and K/V from another. This is how encoder-decoder models work (decoder queries the encoder).
Relative positions: Instead of adding positional encoding to the input, add a learned relative position bias to the attention scores: $scores_{ij} + = B_{i - j}$ .
Real self-attention: Take a sentence, embed each word using random vectors, pass through self-attention. Visualize which words attend to which — even random weights will show the “attending to self” pattern.

AI/ML Notes

Explorer

08 - Attention Mechanism from Scratch

Attention Mechanism from Scratch

Why Attention?

Scaled Dot-Product Attention

Why scale by $d_{k}$ ?

Worked Example

Visualize Attention Weights

Causal Mask (for Autoregressive Models)

Multi-Head Attention

Visualize per-head attention

Positional Encoding

Why sin/cos?

Putting It Together: Mini Transformer Block

Exercises

See Also

Graph View

Table of Contents

Backlinks

AI/ML Notes

Explorer

08 - Attention Mechanism from Scratch

Attention Mechanism from Scratch

Why Attention?

Scaled Dot-Product Attention

Why scale by dk​​?

Worked Example

Visualize Attention Weights

Causal Mask (for Autoregressive Models)

Multi-Head Attention

Visualize per-head attention

Positional Encoding

Why sin/cos?

Putting It Together: Mini Transformer Block

Exercises

See Also

Graph View

Table of Contents

Backlinks

Why scale by $d_{k}$ ?