Build a Tensor Library

Goal: Build a tiny tensor library with broadcasting, reductions, and autograd on arrays (not scalars). This bridges micrograd (tutorial 15) to real frameworks like PyTorch/tinygrad. The next step after understanding 27 - Tensor Internals Strides and Views.

Prerequisites: 15 - Build an Autograd Engine, 27 - Tensor Internals Strides and Views, NumPy Essentials


From Scalars to Tensors

Tutorial 15 built autograd on individual numbers. Real networks operate on arrays. The challenge: gradients of array operations involve shapes, broadcasting, and reductions.

import numpy as np
 
class Tensor:
    def __init__(self, data, _children=(), _op='', requires_grad=False):
        self.data = np.array(data, dtype=np.float32)
        self.grad = np.zeros_like(self.data)
        self.requires_grad = requires_grad
        self._backward = lambda: None
        self._prev = set(_children)
        self._op = _op
 
    @property
    def shape(self): return self.data.shape
 
    def __repr__(self):
        return f"Tensor(shape={self.shape}, op={self._op})"

Broadcasting: The Hard Part

When shapes don’t match, NumPy broadcasts. The backward pass must undo the broadcast by summing over the broadcast dimensions:

def _unbroadcast(grad, target_shape):
    """Reverse broadcasting: sum over dimensions that were broadcast."""
    # Add leading dimensions if needed
    while len(grad.shape) > len(target_shape):
        grad = grad.sum(axis=0)
    # Sum over broadcast dimensions (where target has size 1)
    for i, (g, t) in enumerate(zip(grad.shape, target_shape)):
        if t == 1 and g != 1:
            grad = grad.sum(axis=i, keepdims=True)
    return grad

Why? If a has shape (3, 1) and gets broadcast to (3, 4), the gradient from all 4 columns must flow back to the single column.


Element-wise Operations

def add(self, other):
    other = other if isinstance(other, Tensor) else Tensor(other)
    out = Tensor(self.data + other.data, (self, other), '+')
 
    def _backward():
        if self.requires_grad:
            self.grad += _unbroadcast(out.grad, self.shape)
        if other.requires_grad:
            other.grad += _unbroadcast(out.grad, other.shape)
    out._backward = _backward
    return out
Tensor.__add__ = add
Tensor.__radd__ = lambda self, other: Tensor(other).__add__(self)
 
def mul(self, other):
    other = other if isinstance(other, Tensor) else Tensor(other)
    out = Tensor(self.data * other.data, (self, other), '*')
 
    def _backward():
        if self.requires_grad:
            self.grad += _unbroadcast(other.data * out.grad, self.shape)
        if other.requires_grad:
            other.grad += _unbroadcast(self.data * out.grad, other.shape)
    out._backward = _backward
    return out
Tensor.__mul__ = mul
Tensor.__rmul__ = lambda self, other: Tensor(other).__mul__(self)
 
def neg(self):
    return self * (-1)
Tensor.__neg__ = neg
Tensor.__sub__ = lambda self, other: self + (-other if isinstance(other, Tensor) else Tensor(-np.array(other, dtype=np.float32)))

Reduction Operations

Reductions change shape — the backward must expand the gradient back:

def sum(self, axis=None, keepdims=False):
    out = Tensor(self.data.sum(axis=axis, keepdims=keepdims), (self,), 'sum')
 
    def _backward():
        if self.requires_grad:
            grad = out.grad
            if axis is not None and not keepdims:
                grad = np.expand_dims(grad, axis)
            self.grad += np.broadcast_to(grad, self.shape)
    out._backward = _backward
    return out
Tensor.sum = sum
 
def mean(self, axis=None, keepdims=False):
    s = self.sum(axis=axis, keepdims=keepdims)
    n = self.data.size if axis is None else self.data.shape[axis] if isinstance(axis, int) else np.prod([self.data.shape[a] for a in axis])
    return s * (1.0 / n)
Tensor.mean = mean

Why sum backward is broadcast

sum([a, b, c]) = sds/da = 1, ds/db = 1, ds/dc = 1

The scalar gradient fans out to all elements.


Matrix Multiply

def matmul(self, other):
    out = Tensor(self.data @ other.data, (self, other), '@')
 
    def _backward():
        if self.requires_grad:
            self.grad += out.grad @ other.data.T
        if other.requires_grad:
            other.grad += self.data.T @ out.grad
    out._backward = _backward
    return out
Tensor.__matmul__ = matmul

The gradient of matmul: if C = A @ B, then dA = dC @ B^T and dB = A^T @ dC.


Activations

def relu(self):
    out = Tensor(np.maximum(0, self.data), (self,), 'relu')
 
    def _backward():
        if self.requires_grad:
            self.grad += (self.data > 0) * out.grad
    out._backward = _backward
    return out
Tensor.relu = relu
 
def exp(self):
    out = Tensor(np.exp(self.data), (self,), 'exp')
 
    def _backward():
        if self.requires_grad:
            self.grad += out.data * out.grad  # d/dx e^x = e^x
    out._backward = _backward
    return out
Tensor.exp = exp
 
def log(self):
    out = Tensor(np.log(self.data + 1e-15), (self,), 'log')
 
    def _backward():
        if self.requires_grad:
            self.grad += out.grad / (self.data + 1e-15)
    out._backward = _backward
    return out
Tensor.log = log

Reshape (Zero-Copy)

def reshape(self, *shape):
    out = Tensor(self.data.reshape(*shape), (self,), 'reshape')
 
    def _backward():
        if self.requires_grad:
            self.grad += out.grad.reshape(self.shape)
    out._backward = _backward
    return out
Tensor.reshape = reshape
 
def transpose(self, *axes):
    axes = axes if axes else tuple(reversed(range(len(self.shape))))
    out = Tensor(self.data.transpose(axes), (self,), 'T')
 
    def _backward():
        if self.requires_grad:
            self.grad += out.grad.transpose(np.argsort(axes))
    out._backward = _backward
    return out
Tensor.transpose = transpose
Tensor.T = property(lambda self: self.transpose())

The Backward Pass

Same topological sort as micrograd, but now on tensors:

def backward(self):
    topo = []
    visited = set()
    def build_topo(v):
        if id(v) not in visited:
            visited.add(id(v))
            for child in v._prev:
                build_topo(child)
            topo.append(v)
    build_topo(self)
 
    self.grad = np.ones_like(self.data)
    for v in reversed(topo):
        v._backward()
Tensor.backward = backward

Test: Train a Neural Network

np.random.seed(42)
 
# Data: XOR
X = Tensor(np.array([[0,0],[0,1],[1,0],[1,1]], dtype=np.float32), requires_grad=False)
y = Tensor(np.array([[0],[1],[1],[0]], dtype=np.float32), requires_grad=False)
 
# Weights
W1 = Tensor(np.random.randn(2, 8).astype(np.float32) * 0.5, requires_grad=True)
b1 = Tensor(np.zeros((1, 8), dtype=np.float32), requires_grad=True)
W2 = Tensor(np.random.randn(8, 1).astype(np.float32) * 0.5, requires_grad=True)
b2 = Tensor(np.zeros((1, 1), dtype=np.float32), requires_grad=True)
 
losses = []
for step in range(1000):
    # Forward
    h = (X @ W1 + b1).relu()
    pred = h @ W2 + b2
    diff = pred + (-y)
    loss = (diff * diff).mean()
    losses.append(loss.data.item())
 
    # Backward
    for p in [W1, b1, W2, b2]:
        p.grad = np.zeros_like(p.data)
    loss.backward()
 
    # Update
    lr = 0.1
    for p in [W1, b1, W2, b2]:
        p.data -= lr * p.grad
 
    if step % 200 == 0:
        preds = ((X @ W1 + b1).relu() @ W2 + b2).data.round(2)
        print(f"Step {step:4d} | Loss: {loss.data.item():.4f} | Preds: {preds.ravel()}")
 
import matplotlib.pyplot as plt
plt.plot(losses)
plt.xlabel("Step"); plt.ylabel("MSE")
plt.title("Tensor library training XOR")
plt.show()

Verify Against PyTorch

import torch
 
X_t = torch.tensor([[0,0],[0,1],[1,0],[1,1]], dtype=torch.float32)
y_t = torch.tensor([[0],[1],[1],[0]], dtype=torch.float32)
 
W1_t = torch.tensor(W1.data, requires_grad=True)
b1_t = torch.tensor(b1.data, requires_grad=True)
W2_t = torch.tensor(W2.data, requires_grad=True)
b2_t = torch.tensor(b2.data, requires_grad=True)
 
h = torch.relu(X_t @ W1_t + b1_t)
pred = h @ W2_t + b2_t
loss = ((pred - y_t) ** 2).mean()
loss.backward()
 
print("Gradient comparison (ours vs PyTorch):")
for name, ours, theirs in [("W1", W1, W1_t), ("b1", b1, b1_t),
                             ("W2", W2, W2_t), ("b2", b2, b2_t)]:
    match = np.allclose(ours.grad, theirs.grad.numpy(), atol=1e-5)
    print(f"  {name}: match={match}")

What’s Missing (vs Real Frameworks)

FeatureOur TensorPyTorch/tinygrad
Lazy executionNo (eager)Yes (graph + schedule)
GPU supportNo (NumPy only)Yes (CUDA, Metal, etc.)
Kernel fusionNoYes (compiler)
Memory viewsPartial (reshape)Full (strides, ShapeTracker)
Dynamic shapesNotinygrad: symbolic shapes
Operator count~10~25 primitives
AutogradTape-basedtinygrad: pattern matching

Our library is ~100 lines for the same semantics. The remaining 10K+ lines in real frameworks handle performance (GPU, fusion, memory) and edge cases.


Exercises

  1. Add softmax + cross-entropy: Implement softmax and cross_entropy_loss and train a 10-class classifier on sklearn’s digits dataset.

  2. Convolution: Implement conv2d as im2col → matmul → reshape. The backward is matmul backward → col2im.

  3. Adam optimizer: Implement Adam with momentum and RMSProp state tracking. Our Tensor class stores .data which can be updated in-place.

  4. Batch dimension: Our library already handles batches via broadcasting. Verify: set X to shape (32, 784) (a batch of MNIST images) and run a forward pass.

  5. Compare with tinygrad: Install tinygrad (pip install tinygrad) and implement the same XOR network. Compare the API. tinygrad’s Tensor has the same interface but compiles to GPU kernels.


This is the endpoint of the “build it from scratch” track. You’ve now built: scalars with autograd (micrograd, tutorial 15), understood memory layout (tutorial 27), and built tensor operations with broadcasting and backprop. Real frameworks are this — just faster, on GPUs, with more ops.