How DL Frameworks Work
Goal: Understand the full pipeline from (a @ b).relu().backward() to GPU execution. What happens inside PyTorch/tinygrad when you write tensor code? Lazy evaluation, scheduling, kernel fusion, code generation.
Prerequisites: 15 - Build an Autograd Engine, Backpropagation, 05 - Neural Network from Scratch
The Big Picture
When you write:
c = (a @ b + bias).relu()
loss = c.sum()
loss.backward()The framework does NOT compute each operation immediately. It:
- Builds a computation graph (lazy)
- Schedules which operations to run and in what order
- Fuses multiple operations into single GPU kernels
- Generates device-specific code (CUDA, Metal, etc.)
- Executes on hardware
This tutorial traces each step.
Layer 1: The Computation Graph
Every operation creates a node in a directed acyclic graph (DAG):
import numpy as np
class LazyTensor:
"""A tensor that records operations instead of executing them."""
_counter = 0
def __init__(self, shape, op="input", sources=(), data=None):
LazyTensor._counter += 1
self.id = LazyTensor._counter
self.shape = shape
self.op = op
self.sources = sources
self.data = data # only set after .realize()
def __repr__(self):
return f"LazyTensor(id={self.id}, shape={self.shape}, op={self.op})"
def __add__(self, other):
assert self.shape == other.shape, "shapes must match (ignoring broadcasting for now)"
return LazyTensor(self.shape, op="add", sources=(self, other))
def __matmul__(self, other):
assert self.shape[-1] == other.shape[0]
out_shape = (*self.shape[:-1], other.shape[-1])
return LazyTensor(out_shape, op="matmul", sources=(self, other))
def relu(self):
return LazyTensor(self.shape, op="relu", sources=(self,))
def sum(self):
return LazyTensor((), op="sum", sources=(self,))
# Build graph — NO computation happens
a = LazyTensor((4, 3), op="input")
b = LazyTensor((3, 2), op="input")
bias = LazyTensor((4, 2), op="input")
c = (a @ b + bias).relu()
loss = c.sum()
print(f"Created graph with {LazyTensor._counter} nodes")
print(f"loss: {loss}")
print(f"loss depends on: {[s.op for s in loss.sources]}")Why lazy?
Eager execution (compute immediately) can’t optimize across operations. Lazy execution sees the full graph and can:
- Fuse
add + reluinto one kernel (one memory pass instead of two) - Eliminate unused computations
- Reorder operations for better memory access
Layer 2: Topological Sort & Scheduling
Before execution, we need to determine the order:
def toposort(tensor, visited=None, order=None):
"""Topological sort — dependencies before dependents."""
if visited is None:
visited, order = set(), []
if tensor.id in visited:
return order
visited.add(tensor.id)
for src in tensor.sources:
toposort(src, visited, order)
order.append(tensor)
return order
schedule = toposort(loss)
print("Execution schedule:")
for i, node in enumerate(schedule):
src_ids = [s.id for s in node.sources]
print(f" {i}: [{node.op}] id={node.id} shape={node.shape} ← {src_ids}")The schedule guarantees: every tensor is computed before anything that depends on it.
Layer 3: Kernel Fusion
The key optimization. Without fusion:
Kernel 1: c_temp = a @ b # read a,b from memory, write c_temp
Kernel 2: c_bias = c_temp + bias # read c_temp,bias from memory, write c_bias
Kernel 3: c = relu(c_bias) # read c_bias from memory, write c
That’s 6 memory reads and 3 memory writes. GPU memory bandwidth is the bottleneck.
With fusion:
Kernel 1: c = relu(a @ b + bias) # read a,b,bias from memory, write c
3 reads, 1 write. The intermediate values stay in fast GPU registers.
def fuse_elementwise(schedule):
"""Simple fusion: merge consecutive element-wise ops into one kernel."""
fused = []
current_kernel = []
elementwise_ops = {"add", "relu", "mul", "sub", "neg"}
for node in schedule:
if node.op in elementwise_ops and current_kernel:
current_kernel.append(node)
else:
if current_kernel:
fused.append(("fused_kernel", current_kernel))
current_kernel = []
if node.op in elementwise_ops:
current_kernel = [node]
else:
fused.append(("single", [node]))
if current_kernel:
fused.append(("fused_kernel", current_kernel))
return fused
fused_schedule = fuse_elementwise(schedule)
print("\nFused schedule:")
for kernel_type, nodes in fused_schedule:
ops = [n.op for n in nodes]
print(f" {kernel_type}: {ops}")What tinygrad does
tinygrad takes this further — it fuses operations across reductions and reshapes when possible, generating one GPU kernel for complex expressions. The compiler analyzes the full UOp DAG (their intermediate representation) and decides which operations can share a kernel.
Layer 4: Code Generation
After fusion, the fused kernels must become actual GPU code:
def generate_kernel_code(kernel_ops, backend="c"):
"""Generate C-like pseudocode for a fused kernel."""
code_lines = []
code_lines.append(f"// Fused kernel: {' → '.join(o.op for o in kernel_ops)}")
if backend == "c":
# For a simple element-wise kernel
code_lines.append("void kernel(float* out, float* in0, float* in1, int n) {")
code_lines.append(" for (int i = 0; i < n; i++) {")
var = "in0[i]"
for op in kernel_ops:
if op.op == "add":
var = f"({var} + in1[i])"
elif op.op == "relu":
var = f"max(0.0f, {var})"
elif op.op == "mul":
var = f"({var} * in1[i])"
code_lines.append(f" out[i] = {var};")
code_lines.append(" }")
code_lines.append("}")
elif backend == "cuda":
code_lines.append("__global__ void kernel(float* out, float* in0, float* in1, int n) {")
code_lines.append(" int i = blockIdx.x * blockDim.x + threadIdx.x;")
code_lines.append(" if (i < n) {")
var = "in0[i]"
for op in kernel_ops:
if op.op == "add":
var = f"({var} + in1[i])"
elif op.op == "relu":
var = f"fmaxf(0.0f, {var})"
code_lines.append(f" out[i] = {var};")
code_lines.append(" }")
code_lines.append("}")
return "\n".join(code_lines)
# Example
a_node = LazyTensor((1024,), "input")
b_node = LazyTensor((1024,), "input")
add_node = LazyTensor((1024,), "add", (a_node, b_node))
relu_node = LazyTensor((1024,), "relu", (add_node,))
print("Generated C kernel:")
print(generate_kernel_code([add_node, relu_node], "c"))
print("\nGenerated CUDA kernel:")
print(generate_kernel_code([add_node, relu_node], "cuda"))Real framework code generation
tinygrad’s pipeline has 9 stages: preprocess → optimize → expand → add local buffers → devectorize → decompose → final rewrite → linearize → render. Each stage transforms the UOp graph using pattern matching rules.
PyTorch uses TorchInductor which generates Triton GPU code (which itself compiles to CUDA PTX).
Layer 5: The Runtime
The generated code gets compiled and executed:
import ctypes, tempfile, os, subprocess
def compile_and_run_c(code, inputs, output_size):
"""Actually compile C code and run it."""
# Write code to file
with tempfile.NamedTemporaryFile(suffix='.c', mode='w', delete=False) as f:
f.write(code)
c_path = f.name
so_path = c_path.replace('.c', '.so')
# Compile
subprocess.run(['gcc', '-O2', '-shared', '-fPIC', '-o', so_path, c_path],
check=True, capture_output=True)
# Load and run
lib = ctypes.CDLL(so_path)
lib.kernel.argtypes = [ctypes.POINTER(ctypes.c_float)] * (len(inputs) + 1) + [ctypes.c_int]
lib.kernel.restype = None
out = (ctypes.c_float * output_size)()
c_inputs = [(ctypes.c_float * len(inp))(*inp) for inp in inputs]
lib.kernel(out, *c_inputs, output_size)
os.unlink(c_path); os.unlink(so_path)
return list(out)
# Test: fused add + relu
code = """
#include <math.h>
void kernel(float* out, float* a, float* b, int n) {
for (int i = 0; i < n; i++) {
float tmp = a[i] + b[i];
out[i] = tmp > 0 ? tmp : 0;
}
}
"""
a_data = [1.0, -2.0, 3.0, -4.0]
b_data = [0.5, 3.0, -1.0, 5.0]
result = compile_and_run_c(code, [a_data, b_data], 4)
print(f"a: {a_data}")
print(f"b: {b_data}")
print(f"relu(a+b): {result}")
# Verify
expected = [max(0, a+b) for a, b in zip(a_data, b_data)]
print(f"expected: {expected}")The Complete Pipeline
User writes: c = (a + b).relu()
↓
Graph build: UOp(RELU, src=[UOp(ADD, src=[a.uop, b.uop])])
↓
Schedule: [load a, load b, fused_add_relu, store c]
↓
Codegen: "out[i] = fmax(0, a[i] + b[i])"
↓
Compile: gcc/nvcc/metal compiler → binary
↓
Execute: binary runs on CPU/GPU
↓
Result: c.data = [1.5, 0.0, 2.0, 1.0]
tinygrad’s insight: One IR for everything
Most frameworks have separate representations for:
- Tensor operations (Python level)
- Kernel operations (compiler level)
- Device code (CUDA/Metal level)
tinygrad uses one IR (UOp) for all three. A UOp can represent a high-level reshape OR a low-level GPU thread index. The compiler just transforms UOps through progressive lowering.
JIT Compilation: Why Training Loops Are Fast
The first forward pass compiles kernels. Subsequent passes reuse them:
class SimpleJIT:
"""Cache compiled kernels for repeated execution."""
def __init__(self):
self.cache = {}
def run(self, graph_key, compile_fn, *args):
if graph_key not in self.cache:
print(f"JIT miss: compiling '{graph_key}'")
self.cache[graph_key] = compile_fn(*args)
else:
print(f"JIT hit: reusing '{graph_key}'")
return self.cache[graph_key](*args)
jit = SimpleJIT()
# First call: compiles
jit.run("add_relu", lambda *a: lambda *a: "result", 1, 2)
# Second call: reuses
jit.run("add_relu", lambda *a: lambda *a: "result", 3, 4)tinygrad’s @TinyJit decorator captures the entire training step and replays the compiled schedule, only swapping input data pointers.
Exercises
-
Trace PyTorch: Set
TORCH_SHOW_DISPATCH_TRACE=1and run a simple operation. Read the dispatch trace — it shows the same pipeline (dispatch → kernel selection → execution). -
Fusion impact: Time
relu(a + b)vs doingc = a + b; d = relu(c)in PyTorch with large tensors. Measure the memory bandwidth difference. -
Dead code elimination: Add a node to the graph that nothing depends on. Modify
toposortto only include nodes reachable from the output. This is one of the simplest compiler optimizations. -
Explore tinygrad: Clone
github.com/tinygrad/tinygrad, runDEBUG=4 python -c "from tinygrad import Tensor; (Tensor([1,2])+Tensor([3,4])).realize()". Read the debug output — it shows every stage of the pipeline.
Next: 26 - The Minimal Op Set — all of deep learning reduces to ~25 operations.