How DL Frameworks Work

Goal: Understand the full pipeline from (a @ b).relu().backward() to GPU execution. What happens inside PyTorch/tinygrad when you write tensor code? Lazy evaluation, scheduling, kernel fusion, code generation.

Prerequisites: 15 - Build an Autograd Engine, Backpropagation, 05 - Neural Network from Scratch

The Big Picture

When you write:

c = (a @ b + bias).relu()
loss = c.sum()
loss.backward()

The framework does NOT compute each operation immediately. It:

Builds a computation graph (lazy)
Schedules which operations to run and in what order
Fuses multiple operations into single GPU kernels
Generates device-specific code (CUDA, Metal, etc.)
Executes on hardware

This tutorial traces each step.

Layer 1: The Computation Graph

Every operation creates a node in a directed acyclic graph (DAG):

import numpy as np
 
class LazyTensor:
    """A tensor that records operations instead of executing them."""
    _counter = 0
 
    def __init__(self, shape, op="input", sources=(), data=None):
        LazyTensor._counter += 1
        self.id = LazyTensor._counter
        self.shape = shape
        self.op = op
        self.sources = sources
        self.data = data  # only set after .realize()
 
    def __repr__(self):
        return f"LazyTensor(id={self.id}, shape={self.shape}, op={self.op})"
 
    def __add__(self, other):
        assert self.shape == other.shape, "shapes must match (ignoring broadcasting for now)"
        return LazyTensor(self.shape, op="add", sources=(self, other))
 
    def __matmul__(self, other):
        assert self.shape[-1] == other.shape[0]
        out_shape = (*self.shape[:-1], other.shape[-1])
        return LazyTensor(out_shape, op="matmul", sources=(self, other))
 
    def relu(self):
        return LazyTensor(self.shape, op="relu", sources=(self,))
 
    def sum(self):
        return LazyTensor((), op="sum", sources=(self,))
 
# Build graph — NO computation happens
a = LazyTensor((4, 3), op="input")
b = LazyTensor((3, 2), op="input")
bias = LazyTensor((4, 2), op="input")
 
c = (a @ b + bias).relu()
loss = c.sum()
 
print(f"Created graph with {LazyTensor._counter} nodes")
print(f"loss: {loss}")
print(f"loss depends on: {[s.op for s in loss.sources]}")

Why lazy?

Eager execution (compute immediately) can’t optimize across operations. Lazy execution sees the full graph and can:

Fuse add + relu into one kernel (one memory pass instead of two)
Eliminate unused computations
Reorder operations for better memory access

Layer 2: Topological Sort & Scheduling

Before execution, we need to determine the order:

def toposort(tensor, visited=None, order=None):
    """Topological sort — dependencies before dependents."""
    if visited is None:
        visited, order = set(), []
    if tensor.id in visited:
        return order
    visited.add(tensor.id)
    for src in tensor.sources:
        toposort(src, visited, order)
    order.append(tensor)
    return order
 
schedule = toposort(loss)
print("Execution schedule:")
for i, node in enumerate(schedule):
    src_ids = [s.id for s in node.sources]
    print(f"  {i}: [{node.op}] id={node.id} shape={node.shape} ← {src_ids}")

The schedule guarantees: every tensor is computed before anything that depends on it.

Layer 3: Kernel Fusion

The key optimization. Without fusion:

Kernel 1: c_temp = a @ b          # read a,b from memory, write c_temp
Kernel 2: c_bias = c_temp + bias  # read c_temp,bias from memory, write c_bias
Kernel 3: c = relu(c_bias)        # read c_bias from memory, write c

That’s 6 memory reads and 3 memory writes. GPU memory bandwidth is the bottleneck.

With fusion:

Kernel 1: c = relu(a @ b + bias)  # read a,b,bias from memory, write c

3 reads, 1 write. The intermediate values stay in fast GPU registers.

def fuse_elementwise(schedule):
    """Simple fusion: merge consecutive element-wise ops into one kernel."""
    fused = []
    current_kernel = []
 
    elementwise_ops = {"add", "relu", "mul", "sub", "neg"}
 
    for node in schedule:
        if node.op in elementwise_ops and current_kernel:
            current_kernel.append(node)
        else:
            if current_kernel:
                fused.append(("fused_kernel", current_kernel))
                current_kernel = []
            if node.op in elementwise_ops:
                current_kernel = [node]
            else:
                fused.append(("single", [node]))
 
    if current_kernel:
        fused.append(("fused_kernel", current_kernel))
 
    return fused
 
fused_schedule = fuse_elementwise(schedule)
print("\nFused schedule:")
for kernel_type, nodes in fused_schedule:
    ops = [n.op for n in nodes]
    print(f"  {kernel_type}: {ops}")

What tinygrad does

tinygrad takes this further — it fuses operations across reductions and reshapes when possible, generating one GPU kernel for complex expressions. The compiler analyzes the full UOp DAG (their intermediate representation) and decides which operations can share a kernel.

Layer 4: Code Generation

After fusion, the fused kernels must become actual GPU code:

def generate_kernel_code(kernel_ops, backend="c"):
    """Generate C-like pseudocode for a fused kernel."""
    code_lines = []
    code_lines.append(f"// Fused kernel: {' → '.join(o.op for o in kernel_ops)}")
 
    if backend == "c":
        # For a simple element-wise kernel
        code_lines.append("void kernel(float* out, float* in0, float* in1, int n) {")
        code_lines.append("  for (int i = 0; i < n; i++) {")
 
        var = "in0[i]"
        for op in kernel_ops:
            if op.op == "add":
                var = f"({var} + in1[i])"
            elif op.op == "relu":
                var = f"max(0.0f, {var})"
            elif op.op == "mul":
                var = f"({var} * in1[i])"
 
        code_lines.append(f"    out[i] = {var};")
        code_lines.append("  }")
        code_lines.append("}")
 
    elif backend == "cuda":
        code_lines.append("__global__ void kernel(float* out, float* in0, float* in1, int n) {")
        code_lines.append("  int i = blockIdx.x * blockDim.x + threadIdx.x;")
        code_lines.append("  if (i < n) {")
 
        var = "in0[i]"
        for op in kernel_ops:
            if op.op == "add":
                var = f"({var} + in1[i])"
            elif op.op == "relu":
                var = f"fmaxf(0.0f, {var})"
 
        code_lines.append(f"    out[i] = {var};")
        code_lines.append("  }")
        code_lines.append("}")
 
    return "\n".join(code_lines)
 
# Example
a_node = LazyTensor((1024,), "input")
b_node = LazyTensor((1024,), "input")
add_node = LazyTensor((1024,), "add", (a_node, b_node))
relu_node = LazyTensor((1024,), "relu", (add_node,))
 
print("Generated C kernel:")
print(generate_kernel_code([add_node, relu_node], "c"))
print("\nGenerated CUDA kernel:")
print(generate_kernel_code([add_node, relu_node], "cuda"))

Real framework code generation

tinygrad’s pipeline has 9 stages: preprocess → optimize → expand → add local buffers → devectorize → decompose → final rewrite → linearize → render. Each stage transforms the UOp graph using pattern matching rules.

PyTorch uses TorchInductor which generates Triton GPU code (which itself compiles to CUDA PTX).

Layer 5: The Runtime

The generated code gets compiled and executed:

import ctypes, tempfile, os, subprocess
 
def compile_and_run_c(code, inputs, output_size):
    """Actually compile C code and run it."""
    # Write code to file
    with tempfile.NamedTemporaryFile(suffix='.c', mode='w', delete=False) as f:
        f.write(code)
        c_path = f.name
    so_path = c_path.replace('.c', '.so')
 
    # Compile
    subprocess.run(['gcc', '-O2', '-shared', '-fPIC', '-o', so_path, c_path],
                   check=True, capture_output=True)
 
    # Load and run
    lib = ctypes.CDLL(so_path)
    lib.kernel.argtypes = [ctypes.POINTER(ctypes.c_float)] * (len(inputs) + 1) + [ctypes.c_int]
    lib.kernel.restype = None
 
    out = (ctypes.c_float * output_size)()
    c_inputs = [(ctypes.c_float * len(inp))(*inp) for inp in inputs]
    lib.kernel(out, *c_inputs, output_size)
 
    os.unlink(c_path); os.unlink(so_path)
    return list(out)
 
# Test: fused add + relu
code = """
#include <math.h>
void kernel(float* out, float* a, float* b, int n) {
    for (int i = 0; i < n; i++) {
        float tmp = a[i] + b[i];
        out[i] = tmp > 0 ? tmp : 0;
    }
}
"""
 
a_data = [1.0, -2.0, 3.0, -4.0]
b_data = [0.5, 3.0, -1.0, 5.0]
result = compile_and_run_c(code, [a_data, b_data], 4)
print(f"a:      {a_data}")
print(f"b:      {b_data}")
print(f"relu(a+b): {result}")
 
# Verify
expected = [max(0, a+b) for a, b in zip(a_data, b_data)]
print(f"expected:  {expected}")

The Complete Pipeline

User writes:  c = (a + b).relu()
                    ↓
Graph build:  UOp(RELU, src=[UOp(ADD, src=[a.uop, b.uop])])
                    ↓
Schedule:     [load a, load b, fused_add_relu, store c]
                    ↓
Codegen:      "out[i] = fmax(0, a[i] + b[i])"
                    ↓
Compile:      gcc/nvcc/metal compiler → binary
                    ↓
Execute:      binary runs on CPU/GPU
                    ↓
Result:       c.data = [1.5, 0.0, 2.0, 1.0]

tinygrad’s insight: One IR for everything

Most frameworks have separate representations for:

Tensor operations (Python level)
Kernel operations (compiler level)
Device code (CUDA/Metal level)

tinygrad uses one IR (UOp) for all three. A UOp can represent a high-level reshape OR a low-level GPU thread index. The compiler just transforms UOps through progressive lowering.

JIT Compilation: Why Training Loops Are Fast

The first forward pass compiles kernels. Subsequent passes reuse them:

class SimpleJIT:
    """Cache compiled kernels for repeated execution."""
    def __init__(self):
        self.cache = {}
 
    def run(self, graph_key, compile_fn, *args):
        if graph_key not in self.cache:
            print(f"JIT miss: compiling '{graph_key}'")
            self.cache[graph_key] = compile_fn(*args)
        else:
            print(f"JIT hit: reusing '{graph_key}'")
        return self.cache[graph_key](*args)
 
jit = SimpleJIT()
 
# First call: compiles
jit.run("add_relu", lambda *a: lambda *a: "result", 1, 2)
# Second call: reuses
jit.run("add_relu", lambda *a: lambda *a: "result", 3, 4)

tinygrad’s @TinyJit decorator captures the entire training step and replays the compiled schedule, only swapping input data pointers.

Exercises

Trace PyTorch: Set TORCH_SHOW_DISPATCH_TRACE=1 and run a simple operation. Read the dispatch trace — it shows the same pipeline (dispatch → kernel selection → execution).
Fusion impact: Time relu(a + b) vs doing c = a + b; d = relu(c) in PyTorch with large tensors. Measure the memory bandwidth difference.
Dead code elimination: Add a node to the graph that nothing depends on. Modify toposort to only include nodes reachable from the output. This is one of the simplest compiler optimizations.
Explore tinygrad: Clone github.com/tinygrad/tinygrad, run DEBUG=4 python -c "from tinygrad import Tensor; (Tensor([1,2])+Tensor([3,4])).realize()". Read the debug output — it shows every stage of the pipeline.

Next: 26 - The Minimal Op Set — all of deep learning reduces to ~25 operations.

AI/ML Notes

Explorer

25 - How DL Frameworks Work