The Minimal Op Set
Goal: Discover that ALL of deep learning — every layer, every loss function, every optimizer — reduces to ~25 fundamental operations. Derive complex operations from primitives. Insight from tinygrad.
Prerequisites: 05 - Neural Network from Scratch, Neurons and Activation Functions, 15 - Build an Autograd Engine
The Claim
Every neural network operation — matmul, convolution, attention, batch normalization, cross-entropy — can be decomposed into combinations of these primitives:
Arithmetic (8): ADD, MUL, MAX, NEG, RECIP, EXP2, LOG2, SQRT
Comparison (3): CMPLT, CMPNE, CMPEQ
Ternary (1): WHERE (if-then-else)
Reduction (2): REDUCE_ADD, REDUCE_MAX
Movement (6): RESHAPE, EXPAND, PERMUTE, PAD, SHRINK, FLIP
Memory (2): LOAD, STORE
That’s 22 operations. Everything else is syntactic sugar.
Derived Operations: Level 1
Build common math from the primitives:
import numpy as np
# Primitives (we have these)
def add(a, b): return a + b
def mul(a, b): return a * b
def neg(a): return -a
def recip(a): return 1.0 / a # a^(-1)
def exp2(a): return 2.0 ** a
def log2(a): return np.log2(a)
def sqrt(a): return np.sqrt(a)
def maximum(a, b): return np.maximum(a, b)
def where(cond, a, b): return np.where(cond, a, b)
def cmplt(a, b): return a < b
# Level 1 derived ops
def sub(a, b): return add(a, neg(b)) # a - b = a + (-b)
def div(a, b): return mul(a, recip(b)) # a / b = a * (1/b)
def exp(a): return exp2(mul(a, 1.4426950408889634)) # e^a = 2^(a/ln2)
def log(a): return mul(log2(a), 0.6931471805599453) # ln(a) = log2(a) * ln(2)
def pow(a, b): return exp2(mul(log2(a), b)) # a^b = 2^(b*log2(a))
def abs_(a): return where(cmplt(a, 0), neg(a), a)
def sign(a): return where(cmplt(a, 0), -1.0, where(cmplt(0, a), 1.0, 0.0))
# Test
x = np.array([-2.0, -1.0, 0.0, 1.0, 2.0])
print(f"sub(5, 3) = {sub(5.0, 3.0)}")
print(f"div(10, 3) = {div(10.0, 3.0)}")
print(f"exp(1) = {exp(1.0):.6f} (should be 2.718282)")
print(f"log(e) = {log(np.e):.6f} (should be 1.0)")
print(f"pow(2, 10) = {pow(2.0, 10.0)}")
print(f"abs(-3) = {abs_(-3.0)}")Notice: exp and log are built from exp2 and log2. Why? Because GPUs have hardware instructions for base-2 operations (they’re binary machines), not base-e.
Derived Operations: Level 2 — Activation Functions
Every activation function decomposes:
def relu(x):
return maximum(x, 0.0)
def sigmoid(x):
return recip(add(1.0, exp(neg(x)))) # 1 / (1 + e^(-x))
def tanh(x):
e2x = exp(mul(2.0, x))
return div(sub(e2x, 1.0), add(e2x, 1.0))
def gelu(x):
# Approximate: x * sigmoid(1.702 * x)
return mul(x, sigmoid(mul(1.702, x)))
def softmax(x, axis=-1):
x_max = x.max(axis=axis, keepdims=True) # REDUCE_MAX
e = exp(sub(x, x_max)) # stability trick
return div(e, e.sum(axis=axis, keepdims=True)) # REDUCE_ADD
# Test
x = np.array([-2.0, -1.0, 0.0, 1.0, 2.0])
print(f"relu: {relu(x)}")
print(f"sigmoid: {sigmoid(x).round(4)}")
print(f"tanh: {tanh(x).round(4)}")
print(f"gelu: {gelu(x).round(4)}")
print(f"softmax: {softmax(x).round(4)}")Softmax uses only: REDUCE_MAX, SUB (=ADD+NEG), EXP (=EXP2+MUL), REDUCE_ADD, DIV (=MUL+RECIP).
Derived Operations: Level 3 — Loss Functions
def mse_loss(pred, target):
diff = sub(pred, target)
return mul(diff, diff).mean() # REDUCE_ADD + MUL(1/n)
def cross_entropy(logits, labels_onehot):
probs = softmax(logits)
log_probs = log(add(probs, 1e-15)) # avoid log(0)
return neg(mul(labels_onehot, log_probs).sum(axis=-1).mean())
def binary_cross_entropy(pred, target):
return neg(add(
mul(target, log(add(pred, 1e-15))),
mul(sub(1.0, target), log(sub(1.0 + 1e-15, pred)))
)).mean()
# Test
pred = np.array([2.0, 1.0, 0.1])
target = np.array([1.0, 0.0, 0.0]) # class 0
print(f"MSE loss: {mse_loss(pred, target):.4f}")
print(f"Cross-entropy: {cross_entropy(pred, target):.4f}")Movement Operations: The Hidden Foundation
The 6 movement ops are what make tensor operations work without copying data:
# RESHAPE: Reinterpret memory layout
a = np.array([1, 2, 3, 4, 5, 6])
b = a.reshape(2, 3) # same memory, different shape
print(f"reshape: {a.shape} → {b.shape}, same memory: {np.shares_memory(a, b)}")
# EXPAND: Broadcast (virtual copies)
c = np.array([[1], [2], [3]]) # (3, 1)
d = np.broadcast_to(c, (3, 4)) # (3, 4), no new memory
print(f"expand: {c.shape} → {d.shape}")
# PERMUTE: Transpose (reorder dimensions)
e = np.arange(24).reshape(2, 3, 4)
f = e.transpose(0, 2, 1) # (2, 4, 3), same memory
print(f"permute: {e.shape} → {f.shape}, same memory: {np.shares_memory(e, f)}")
# SHRINK: Slice (view into subset)
g = np.arange(10)
h = g[3:7] # same memory, offset pointer
print(f"shrink: {g.shape} → {h.shape}, same memory: {np.shares_memory(g, h)}")
# PAD: Add zeros around edges (only op that sometimes needs new memory)
i = np.array([[1, 2], [3, 4]])
j = np.pad(i, 1, constant_values=0)
print(f"pad: {i.shape} → {j.shape}")
# FLIP: Reverse a dimension (negative stride)
k = np.array([1, 2, 3, 4, 5])
l = k[::-1] # same memory, negative stride
print(f"flip: {k} → {l}, same memory: {np.shares_memory(k, l)}")Why these 6 are enough
Matmul decomposes into movement + element-wise + reduction:
def matmul_from_primitives(A, B):
"""A(m,k) @ B(k,n) using only primitives."""
m, k = A.shape
_, n = B.shape
# Reshape A to (m, k, 1) and B to (1, k, n)
A_expanded = A.reshape(m, k, 1) # RESHAPE
B_expanded = B.reshape(1, k, n) # RESHAPE
# Broadcast both to (m, k, n)
A_broad = np.broadcast_to(A_expanded, (m, k, n)) # EXPAND
B_broad = np.broadcast_to(B_expanded, (m, k, n)) # EXPAND
# Element-wise multiply
product = A_broad * B_broad # MUL
# Sum over k dimension
result = product.sum(axis=1) # REDUCE_ADD
return result
A = np.random.randn(3, 4)
B = np.random.randn(4, 5)
print(f"matmul matches: {np.allclose(matmul_from_primitives(A, B), A @ B)}")Convolution decomposes similarly — but with SHRINK (sliding window) + EXPAND + MUL + REDUCE_ADD.
The Full Decomposition Table
| Operation | Primitives Used |
|---|---|
| subtract | ADD, NEG |
| divide | MUL, RECIP |
| exp | EXP2, MUL |
| log | LOG2, MUL |
| pow | EXP2, LOG2, MUL |
| sigmoid | NEG, EXP, ADD, RECIP |
| tanh | MUL, EXP, ADD, SUB, DIV |
| relu | MAX(x, 0) |
| gelu | MUL, sigmoid |
| softmax | REDUCE_MAX, SUB, EXP, REDUCE_ADD, DIV |
| matmul | RESHAPE, EXPAND, MUL, REDUCE_ADD |
| conv2d | SHRINK (windows), EXPAND, MUL, REDUCE_ADD |
| cross-entropy | softmax, LOG, MUL, NEG, REDUCE_ADD |
| batch norm | REDUCE_ADD (mean), SUB, MUL, SQRT, RECIP, ADD |
| layer norm | REDUCE_ADD (mean, var), SUB, MUL, SQRT, RECIP |
| attention | matmul, MUL (scale), softmax, matmul |
| embedding | SHRINK (index select) |
| dropout | MUL, WHERE, random mask |
Convolution from Primitives
def conv2d_from_primitives(x, w, padding=0):
"""
x: (batch, c_in, h, w)
w: (c_out, c_in, kh, kw)
Using only: pad, shrink (window), expand, mul, reduce_add
"""
batch, c_in, h, w_dim = x.shape
c_out, _, kh, kw = w.shape
# PAD
if padding > 0:
x = np.pad(x, ((0,0), (0,0), (padding,padding), (padding,padding)))
_, _, h, w_dim = x.shape
oh, ow = h - kh + 1, w_dim - kw + 1
# SHRINK: extract sliding windows → (batch, c_in, oh, ow, kh, kw)
windows = np.lib.stride_tricks.sliding_window_view(x, (kh, kw), axis=(2, 3))
# RESHAPE + EXPAND: make shapes compatible for broadcasting
# windows: (batch, c_in, oh, ow, kh, kw)
# w: (c_out, c_in, kh, kw) → (1, c_out, c_in, 1, 1, kh, kw)
w_expanded = w.reshape(1, c_out, c_in, 1, 1, kh, kw)
windows_expanded = windows.reshape(batch, 1, c_in, oh, ow, kh, kw)
# MUL + REDUCE_ADD: element-wise multiply then sum over c_in, kh, kw
product = windows_expanded * w_expanded
result = product.sum(axis=(2, 5, 6)) # sum over c_in, kh, kw
return result # (batch, c_out, oh, ow)
# Test
x = np.random.randn(1, 3, 8, 8) # 1 image, 3 channels, 8×8
w = np.random.randn(16, 3, 3, 3) # 16 filters, 3×3
from scipy.signal import correlate
out = conv2d_from_primitives(x, w, padding=1)
print(f"Conv2d output shape: {out.shape}") # (1, 16, 8, 8)Why This Matters
For understanding
When you see nn.Conv2d(3, 64, 3), you now know it’s just windowed-multiply-sum over primitives. The “magic” of convolution is just smart indexing + matmul.
For hardware
GPU chips implement ~25 instructions. Everything maps to them. When tinygrad generates CUDA code for a transformer, it produces kernels using these exact primitives.
For optimization
Fewer primitive operations = smaller compiler. tinygrad’s entire compiler is ~10K lines because it only needs to optimize 22 operations, not hundreds.
For autograd
With only 22 ops, you need only ~22 gradient rules (like in tinygrad’s pm_gradient pattern matcher, vs PyTorch’s 500+ torch.autograd.Function subclasses).
Exercises
-
Implement batch normalization from primitives: Use only REDUCE_ADD (mean), SUB, MUL, SQRT, RECIP. Verify against
np.mean,np.var. -
Implement attention from primitives: Self-attention is just: matmul(Q, K^T) → scale → softmax → matmul(weights, V). Write it using only the 22 primitives.
-
Count primitives: Take a real model (e.g., ResNet-18) and count how many of each primitive op it uses. Which op dominates? (Spoiler: MUL and ADD, because of matmul.)
-
Missing ops: Try implementing
argmaxfrom the 22 primitives. (Hint: use CMPEQ with the max value to create a mask, then multiply by indices.) -
Explore tinygrad’s ops: Run
grep -r "class.*Ops" tinygrad/uop/ops.pyto see the actual enum. Compare with our list.
Next: 27 - Tensor Internals Strides and Views — how tensors live in memory.