The Minimal Op Set

Goal: Discover that ALL of deep learning — every layer, every loss function, every optimizer — reduces to ~25 fundamental operations. Derive complex operations from primitives. Insight from tinygrad.

Prerequisites: 05 - Neural Network from Scratch, Neurons and Activation Functions, 15 - Build an Autograd Engine


The Claim

Every neural network operation — matmul, convolution, attention, batch normalization, cross-entropy — can be decomposed into combinations of these primitives:

Arithmetic (8): ADD, MUL, MAX, NEG, RECIP, EXP2, LOG2, SQRT Comparison (3): CMPLT, CMPNE, CMPEQ Ternary (1): WHERE (if-then-else) Reduction (2): REDUCE_ADD, REDUCE_MAX Movement (6): RESHAPE, EXPAND, PERMUTE, PAD, SHRINK, FLIP Memory (2): LOAD, STORE

That’s 22 operations. Everything else is syntactic sugar.


Derived Operations: Level 1

Build common math from the primitives:

import numpy as np
 
# Primitives (we have these)
def add(a, b): return a + b
def mul(a, b): return a * b
def neg(a): return -a
def recip(a): return 1.0 / a  # a^(-1)
def exp2(a): return 2.0 ** a
def log2(a): return np.log2(a)
def sqrt(a): return np.sqrt(a)
def maximum(a, b): return np.maximum(a, b)
def where(cond, a, b): return np.where(cond, a, b)
def cmplt(a, b): return a < b
 
# Level 1 derived ops
def sub(a, b): return add(a, neg(b))               # a - b = a + (-b)
def div(a, b): return mul(a, recip(b))              # a / b = a * (1/b)
def exp(a): return exp2(mul(a, 1.4426950408889634)) # e^a = 2^(a/ln2)
def log(a): return mul(log2(a), 0.6931471805599453) # ln(a) = log2(a) * ln(2)
def pow(a, b): return exp2(mul(log2(a), b))         # a^b = 2^(b*log2(a))
def abs_(a): return where(cmplt(a, 0), neg(a), a)
def sign(a): return where(cmplt(a, 0), -1.0, where(cmplt(0, a), 1.0, 0.0))
 
# Test
x = np.array([-2.0, -1.0, 0.0, 1.0, 2.0])
print(f"sub(5, 3)   = {sub(5.0, 3.0)}")
print(f"div(10, 3)  = {div(10.0, 3.0)}")
print(f"exp(1)      = {exp(1.0):.6f} (should be 2.718282)")
print(f"log(e)      = {log(np.e):.6f} (should be 1.0)")
print(f"pow(2, 10)  = {pow(2.0, 10.0)}")
print(f"abs(-3)     = {abs_(-3.0)}")

Notice: exp and log are built from exp2 and log2. Why? Because GPUs have hardware instructions for base-2 operations (they’re binary machines), not base-e.


Derived Operations: Level 2 — Activation Functions

Every activation function decomposes:

def relu(x):
    return maximum(x, 0.0)
 
def sigmoid(x):
    return recip(add(1.0, exp(neg(x))))  # 1 / (1 + e^(-x))
 
def tanh(x):
    e2x = exp(mul(2.0, x))
    return div(sub(e2x, 1.0), add(e2x, 1.0))
 
def gelu(x):
    # Approximate: x * sigmoid(1.702 * x)
    return mul(x, sigmoid(mul(1.702, x)))
 
def softmax(x, axis=-1):
    x_max = x.max(axis=axis, keepdims=True)  # REDUCE_MAX
    e = exp(sub(x, x_max))                    # stability trick
    return div(e, e.sum(axis=axis, keepdims=True))  # REDUCE_ADD
 
# Test
x = np.array([-2.0, -1.0, 0.0, 1.0, 2.0])
print(f"relu:    {relu(x)}")
print(f"sigmoid: {sigmoid(x).round(4)}")
print(f"tanh:    {tanh(x).round(4)}")
print(f"gelu:    {gelu(x).round(4)}")
print(f"softmax: {softmax(x).round(4)}")

Softmax uses only: REDUCE_MAX, SUB (=ADD+NEG), EXP (=EXP2+MUL), REDUCE_ADD, DIV (=MUL+RECIP).


Derived Operations: Level 3 — Loss Functions

def mse_loss(pred, target):
    diff = sub(pred, target)
    return mul(diff, diff).mean()  # REDUCE_ADD + MUL(1/n)
 
def cross_entropy(logits, labels_onehot):
    probs = softmax(logits)
    log_probs = log(add(probs, 1e-15))  # avoid log(0)
    return neg(mul(labels_onehot, log_probs).sum(axis=-1).mean())
 
def binary_cross_entropy(pred, target):
    return neg(add(
        mul(target, log(add(pred, 1e-15))),
        mul(sub(1.0, target), log(sub(1.0 + 1e-15, pred)))
    )).mean()
 
# Test
pred = np.array([2.0, 1.0, 0.1])
target = np.array([1.0, 0.0, 0.0])  # class 0
print(f"MSE loss:    {mse_loss(pred, target):.4f}")
print(f"Cross-entropy: {cross_entropy(pred, target):.4f}")

Movement Operations: The Hidden Foundation

The 6 movement ops are what make tensor operations work without copying data:

# RESHAPE: Reinterpret memory layout
a = np.array([1, 2, 3, 4, 5, 6])
b = a.reshape(2, 3)  # same memory, different shape
print(f"reshape: {a.shape}{b.shape}, same memory: {np.shares_memory(a, b)}")
 
# EXPAND: Broadcast (virtual copies)
c = np.array([[1], [2], [3]])  # (3, 1)
d = np.broadcast_to(c, (3, 4))  # (3, 4), no new memory
print(f"expand: {c.shape}{d.shape}")
 
# PERMUTE: Transpose (reorder dimensions)
e = np.arange(24).reshape(2, 3, 4)
f = e.transpose(0, 2, 1)  # (2, 4, 3), same memory
print(f"permute: {e.shape}{f.shape}, same memory: {np.shares_memory(e, f)}")
 
# SHRINK: Slice (view into subset)
g = np.arange(10)
h = g[3:7]  # same memory, offset pointer
print(f"shrink: {g.shape}{h.shape}, same memory: {np.shares_memory(g, h)}")
 
# PAD: Add zeros around edges (only op that sometimes needs new memory)
i = np.array([[1, 2], [3, 4]])
j = np.pad(i, 1, constant_values=0)
print(f"pad: {i.shape}{j.shape}")
 
# FLIP: Reverse a dimension (negative stride)
k = np.array([1, 2, 3, 4, 5])
l = k[::-1]  # same memory, negative stride
print(f"flip: {k}{l}, same memory: {np.shares_memory(k, l)}")

Why these 6 are enough

Matmul decomposes into movement + element-wise + reduction:

def matmul_from_primitives(A, B):
    """A(m,k) @ B(k,n) using only primitives."""
    m, k = A.shape
    _, n = B.shape
    # Reshape A to (m, k, 1) and B to (1, k, n)
    A_expanded = A.reshape(m, k, 1)               # RESHAPE
    B_expanded = B.reshape(1, k, n)                # RESHAPE
    # Broadcast both to (m, k, n)
    A_broad = np.broadcast_to(A_expanded, (m, k, n))  # EXPAND
    B_broad = np.broadcast_to(B_expanded, (m, k, n))  # EXPAND
    # Element-wise multiply
    product = A_broad * B_broad                         # MUL
    # Sum over k dimension
    result = product.sum(axis=1)                        # REDUCE_ADD
    return result
 
A = np.random.randn(3, 4)
B = np.random.randn(4, 5)
print(f"matmul matches: {np.allclose(matmul_from_primitives(A, B), A @ B)}")

Convolution decomposes similarly — but with SHRINK (sliding window) + EXPAND + MUL + REDUCE_ADD.


The Full Decomposition Table

OperationPrimitives Used
subtractADD, NEG
divideMUL, RECIP
expEXP2, MUL
logLOG2, MUL
powEXP2, LOG2, MUL
sigmoidNEG, EXP, ADD, RECIP
tanhMUL, EXP, ADD, SUB, DIV
reluMAX(x, 0)
geluMUL, sigmoid
softmaxREDUCE_MAX, SUB, EXP, REDUCE_ADD, DIV
matmulRESHAPE, EXPAND, MUL, REDUCE_ADD
conv2dSHRINK (windows), EXPAND, MUL, REDUCE_ADD
cross-entropysoftmax, LOG, MUL, NEG, REDUCE_ADD
batch normREDUCE_ADD (mean), SUB, MUL, SQRT, RECIP, ADD
layer normREDUCE_ADD (mean, var), SUB, MUL, SQRT, RECIP
attentionmatmul, MUL (scale), softmax, matmul
embeddingSHRINK (index select)
dropoutMUL, WHERE, random mask

Convolution from Primitives

def conv2d_from_primitives(x, w, padding=0):
    """
    x: (batch, c_in, h, w)
    w: (c_out, c_in, kh, kw)
    Using only: pad, shrink (window), expand, mul, reduce_add
    """
    batch, c_in, h, w_dim = x.shape
    c_out, _, kh, kw = w.shape
 
    # PAD
    if padding > 0:
        x = np.pad(x, ((0,0), (0,0), (padding,padding), (padding,padding)))
        _, _, h, w_dim = x.shape
 
    oh, ow = h - kh + 1, w_dim - kw + 1
 
    # SHRINK: extract sliding windows → (batch, c_in, oh, ow, kh, kw)
    windows = np.lib.stride_tricks.sliding_window_view(x, (kh, kw), axis=(2, 3))
 
    # RESHAPE + EXPAND: make shapes compatible for broadcasting
    # windows: (batch, c_in, oh, ow, kh, kw)
    # w:       (c_out, c_in, kh, kw) → (1, c_out, c_in, 1, 1, kh, kw)
    w_expanded = w.reshape(1, c_out, c_in, 1, 1, kh, kw)
    windows_expanded = windows.reshape(batch, 1, c_in, oh, ow, kh, kw)
 
    # MUL + REDUCE_ADD: element-wise multiply then sum over c_in, kh, kw
    product = windows_expanded * w_expanded
    result = product.sum(axis=(2, 5, 6))  # sum over c_in, kh, kw
 
    return result  # (batch, c_out, oh, ow)
 
# Test
x = np.random.randn(1, 3, 8, 8)   # 1 image, 3 channels, 8×8
w = np.random.randn(16, 3, 3, 3)  # 16 filters, 3×3
 
from scipy.signal import correlate
out = conv2d_from_primitives(x, w, padding=1)
print(f"Conv2d output shape: {out.shape}")  # (1, 16, 8, 8)

Why This Matters

For understanding

When you see nn.Conv2d(3, 64, 3), you now know it’s just windowed-multiply-sum over primitives. The “magic” of convolution is just smart indexing + matmul.

For hardware

GPU chips implement ~25 instructions. Everything maps to them. When tinygrad generates CUDA code for a transformer, it produces kernels using these exact primitives.

For optimization

Fewer primitive operations = smaller compiler. tinygrad’s entire compiler is ~10K lines because it only needs to optimize 22 operations, not hundreds.

For autograd

With only 22 ops, you need only ~22 gradient rules (like in tinygrad’s pm_gradient pattern matcher, vs PyTorch’s 500+ torch.autograd.Function subclasses).


Exercises

  1. Implement batch normalization from primitives: Use only REDUCE_ADD (mean), SUB, MUL, SQRT, RECIP. Verify against np.mean, np.var.

  2. Implement attention from primitives: Self-attention is just: matmul(Q, K^T) → scale → softmax → matmul(weights, V). Write it using only the 22 primitives.

  3. Count primitives: Take a real model (e.g., ResNet-18) and count how many of each primitive op it uses. Which op dominates? (Spoiler: MUL and ADD, because of matmul.)

  4. Missing ops: Try implementing argmax from the 22 primitives. (Hint: use CMPEQ with the max value to create a mask, then multiply by indices.)

  5. Explore tinygrad’s ops: Run grep -r "class.*Ops" tinygrad/uop/ops.py to see the actual enum. Compare with our list.


Next: 27 - Tensor Internals Strides and Views — how tensors live in memory.