Build a Tensor Library
Goal: Build a tiny tensor library with broadcasting, reductions, and autograd on arrays (not scalars). This bridges micrograd (tutorial 15) to real frameworks like PyTorch/tinygrad. The next step after understanding 27 - Tensor Internals Strides and Views.
Prerequisites: 15 - Build an Autograd Engine, 27 - Tensor Internals Strides and Views, NumPy Essentials
From Scalars to Tensors
Tutorial 15 built autograd on individual numbers. Real networks operate on arrays. The challenge: gradients of array operations involve shapes, broadcasting, and reductions.
import numpy as np
class Tensor:
def __init__(self, data, _children=(), _op='', requires_grad=False):
self.data = np.array(data, dtype=np.float32)
self.grad = np.zeros_like(self.data)
self.requires_grad = requires_grad
self._backward = lambda: None
self._prev = set(_children)
self._op = _op
@property
def shape(self): return self.data.shape
def __repr__(self):
return f"Tensor(shape={self.shape}, op={self._op})"Broadcasting: The Hard Part
When shapes don’t match, NumPy broadcasts. The backward pass must undo the broadcast by summing over the broadcast dimensions:
def _unbroadcast(grad, target_shape):
"""Reverse broadcasting: sum over dimensions that were broadcast."""
# Add leading dimensions if needed
while len(grad.shape) > len(target_shape):
grad = grad.sum(axis=0)
# Sum over broadcast dimensions (where target has size 1)
for i, (g, t) in enumerate(zip(grad.shape, target_shape)):
if t == 1 and g != 1:
grad = grad.sum(axis=i, keepdims=True)
return gradWhy? If a has shape (3, 1) and gets broadcast to (3, 4), the gradient from all 4 columns must flow back to the single column.
Element-wise Operations
def add(self, other):
other = other if isinstance(other, Tensor) else Tensor(other)
out = Tensor(self.data + other.data, (self, other), '+')
def _backward():
if self.requires_grad:
self.grad += _unbroadcast(out.grad, self.shape)
if other.requires_grad:
other.grad += _unbroadcast(out.grad, other.shape)
out._backward = _backward
return out
Tensor.__add__ = add
Tensor.__radd__ = lambda self, other: Tensor(other).__add__(self)
def mul(self, other):
other = other if isinstance(other, Tensor) else Tensor(other)
out = Tensor(self.data * other.data, (self, other), '*')
def _backward():
if self.requires_grad:
self.grad += _unbroadcast(other.data * out.grad, self.shape)
if other.requires_grad:
other.grad += _unbroadcast(self.data * out.grad, other.shape)
out._backward = _backward
return out
Tensor.__mul__ = mul
Tensor.__rmul__ = lambda self, other: Tensor(other).__mul__(self)
def neg(self):
return self * (-1)
Tensor.__neg__ = neg
Tensor.__sub__ = lambda self, other: self + (-other if isinstance(other, Tensor) else Tensor(-np.array(other, dtype=np.float32)))Reduction Operations
Reductions change shape — the backward must expand the gradient back:
def sum(self, axis=None, keepdims=False):
out = Tensor(self.data.sum(axis=axis, keepdims=keepdims), (self,), 'sum')
def _backward():
if self.requires_grad:
grad = out.grad
if axis is not None and not keepdims:
grad = np.expand_dims(grad, axis)
self.grad += np.broadcast_to(grad, self.shape)
out._backward = _backward
return out
Tensor.sum = sum
def mean(self, axis=None, keepdims=False):
s = self.sum(axis=axis, keepdims=keepdims)
n = self.data.size if axis is None else self.data.shape[axis] if isinstance(axis, int) else np.prod([self.data.shape[a] for a in axis])
return s * (1.0 / n)
Tensor.mean = meanWhy sum backward is broadcast
sum([a, b, c]) = s → ds/da = 1, ds/db = 1, ds/dc = 1
The scalar gradient fans out to all elements.
Matrix Multiply
def matmul(self, other):
out = Tensor(self.data @ other.data, (self, other), '@')
def _backward():
if self.requires_grad:
self.grad += out.grad @ other.data.T
if other.requires_grad:
other.grad += self.data.T @ out.grad
out._backward = _backward
return out
Tensor.__matmul__ = matmulThe gradient of matmul: if C = A @ B, then dA = dC @ B^T and dB = A^T @ dC.
Activations
def relu(self):
out = Tensor(np.maximum(0, self.data), (self,), 'relu')
def _backward():
if self.requires_grad:
self.grad += (self.data > 0) * out.grad
out._backward = _backward
return out
Tensor.relu = relu
def exp(self):
out = Tensor(np.exp(self.data), (self,), 'exp')
def _backward():
if self.requires_grad:
self.grad += out.data * out.grad # d/dx e^x = e^x
out._backward = _backward
return out
Tensor.exp = exp
def log(self):
out = Tensor(np.log(self.data + 1e-15), (self,), 'log')
def _backward():
if self.requires_grad:
self.grad += out.grad / (self.data + 1e-15)
out._backward = _backward
return out
Tensor.log = logReshape (Zero-Copy)
def reshape(self, *shape):
out = Tensor(self.data.reshape(*shape), (self,), 'reshape')
def _backward():
if self.requires_grad:
self.grad += out.grad.reshape(self.shape)
out._backward = _backward
return out
Tensor.reshape = reshape
def transpose(self, *axes):
axes = axes if axes else tuple(reversed(range(len(self.shape))))
out = Tensor(self.data.transpose(axes), (self,), 'T')
def _backward():
if self.requires_grad:
self.grad += out.grad.transpose(np.argsort(axes))
out._backward = _backward
return out
Tensor.transpose = transpose
Tensor.T = property(lambda self: self.transpose())The Backward Pass
Same topological sort as micrograd, but now on tensors:
def backward(self):
topo = []
visited = set()
def build_topo(v):
if id(v) not in visited:
visited.add(id(v))
for child in v._prev:
build_topo(child)
topo.append(v)
build_topo(self)
self.grad = np.ones_like(self.data)
for v in reversed(topo):
v._backward()
Tensor.backward = backwardTest: Train a Neural Network
np.random.seed(42)
# Data: XOR
X = Tensor(np.array([[0,0],[0,1],[1,0],[1,1]], dtype=np.float32), requires_grad=False)
y = Tensor(np.array([[0],[1],[1],[0]], dtype=np.float32), requires_grad=False)
# Weights
W1 = Tensor(np.random.randn(2, 8).astype(np.float32) * 0.5, requires_grad=True)
b1 = Tensor(np.zeros((1, 8), dtype=np.float32), requires_grad=True)
W2 = Tensor(np.random.randn(8, 1).astype(np.float32) * 0.5, requires_grad=True)
b2 = Tensor(np.zeros((1, 1), dtype=np.float32), requires_grad=True)
losses = []
for step in range(1000):
# Forward
h = (X @ W1 + b1).relu()
pred = h @ W2 + b2
diff = pred + (-y)
loss = (diff * diff).mean()
losses.append(loss.data.item())
# Backward
for p in [W1, b1, W2, b2]:
p.grad = np.zeros_like(p.data)
loss.backward()
# Update
lr = 0.1
for p in [W1, b1, W2, b2]:
p.data -= lr * p.grad
if step % 200 == 0:
preds = ((X @ W1 + b1).relu() @ W2 + b2).data.round(2)
print(f"Step {step:4d} | Loss: {loss.data.item():.4f} | Preds: {preds.ravel()}")
import matplotlib.pyplot as plt
plt.plot(losses)
plt.xlabel("Step"); plt.ylabel("MSE")
plt.title("Tensor library training XOR")
plt.show()Verify Against PyTorch
import torch
X_t = torch.tensor([[0,0],[0,1],[1,0],[1,1]], dtype=torch.float32)
y_t = torch.tensor([[0],[1],[1],[0]], dtype=torch.float32)
W1_t = torch.tensor(W1.data, requires_grad=True)
b1_t = torch.tensor(b1.data, requires_grad=True)
W2_t = torch.tensor(W2.data, requires_grad=True)
b2_t = torch.tensor(b2.data, requires_grad=True)
h = torch.relu(X_t @ W1_t + b1_t)
pred = h @ W2_t + b2_t
loss = ((pred - y_t) ** 2).mean()
loss.backward()
print("Gradient comparison (ours vs PyTorch):")
for name, ours, theirs in [("W1", W1, W1_t), ("b1", b1, b1_t),
("W2", W2, W2_t), ("b2", b2, b2_t)]:
match = np.allclose(ours.grad, theirs.grad.numpy(), atol=1e-5)
print(f" {name}: match={match}")What’s Missing (vs Real Frameworks)
| Feature | Our Tensor | PyTorch/tinygrad |
|---|---|---|
| Lazy execution | No (eager) | Yes (graph + schedule) |
| GPU support | No (NumPy only) | Yes (CUDA, Metal, etc.) |
| Kernel fusion | No | Yes (compiler) |
| Memory views | Partial (reshape) | Full (strides, ShapeTracker) |
| Dynamic shapes | No | tinygrad: symbolic shapes |
| Operator count | ~10 | ~25 primitives |
| Autograd | Tape-based | tinygrad: pattern matching |
Our library is ~100 lines for the same semantics. The remaining 10K+ lines in real frameworks handle performance (GPU, fusion, memory) and edge cases.
Exercises
-
Add softmax + cross-entropy: Implement
softmaxandcross_entropy_lossand train a 10-class classifier on sklearn’s digits dataset. -
Convolution: Implement
conv2dasim2col → matmul → reshape. The backward ismatmul backward → col2im. -
Adam optimizer: Implement Adam with momentum and RMSProp state tracking. Our Tensor class stores
.datawhich can be updated in-place. -
Batch dimension: Our library already handles batches via broadcasting. Verify: set
Xto shape(32, 784)(a batch of MNIST images) and run a forward pass. -
Compare with tinygrad: Install tinygrad (
pip install tinygrad) and implement the same XOR network. Compare the API. tinygrad’sTensorhas the same interface but compiles to GPU kernels.
This is the endpoint of the “build it from scratch” track. You’ve now built: scalars with autograd (micrograd, tutorial 15), understood memory layout (tutorial 27), and built tensor operations with broadcasting and backprop. Real frameworks are this — just faster, on GPUs, with more ops.