Tensor Internals: Strides and Views
Goal: Understand how tensors are stored in memory, what strides are, why reshape/transpose don’t copy data, and what “contiguous” means. This is how tinygrad’s movement ops (RESHAPE, PERMUTE, EXPAND) work under the hood.
Prerequisites: Vectors and Matrices, NumPy Essentials, 26 - The Minimal Op Set
A Tensor Is Just a Flat Array + Metadata
A 2D matrix doesn’t exist in memory as a grid. It’s a flat array of numbers plus metadata that says how to interpret it:
import numpy as np
# This matrix:
# [[1, 2, 3],
# [4, 5, 6]]
#
# Is stored in memory as: [1, 2, 3, 4, 5, 6]
a = np.array([[1, 2, 3], [4, 5, 6]])
print(f"Shape: {a.shape}") # (2, 3)
print(f"Strides: {a.strides}") # (24, 8) bytes — 3*8=24 bytes per row, 8 bytes per float64
print(f"Data: {a.ravel()}") # [1, 2, 3, 4, 5, 6]Strides tell you: “how many bytes to jump to move one step along each dimension.”
- Stride for axis 0 (rows): 24 bytes = 3 elements × 8 bytes/element
- Stride for axis 1 (cols): 8 bytes = 1 element × 8 bytes/element
How Indexing Works
To find element a[i, j] in the flat array:
def manual_index(data_flat, strides_elements, indices):
"""Index into a flat array using strides."""
offset = sum(i * s for i, s in zip(indices, strides_elements))
return data_flat[offset]
flat = [1, 2, 3, 4, 5, 6]
strides = [3, 1] # in elements, not bytes
# a[1, 2] = flat[1*3 + 2*1] = flat[5] = 6
print(f"a[1, 2] = {manual_index(flat, strides, (1, 2))}")
# a[0, 1] = flat[0*3 + 1*1] = flat[1] = 2
print(f"a[0, 1] = {manual_index(flat, strides, (0, 1))}")Reshape: Change Metadata, Not Data
a = np.arange(12) # [0, 1, 2, ..., 11]
print(f"1D: shape={a.shape}, strides={a.strides}")
b = a.reshape(3, 4)
print(f"2D: shape={b.shape}, strides={b.strides}")
c = a.reshape(2, 2, 3)
print(f"3D: shape={c.shape}, strides={c.strides}")
# All share the SAME memory
print(f"Same memory: {np.shares_memory(a, b)} and {np.shares_memory(b, c)}")
a[0] = 999
print(f"After modifying a[0]: b[0, 0] = {b[0, 0]}") # also 999Reshape just changes shape and recalculates strides. Zero cost.
def compute_strides(shape, itemsize=1):
"""Compute C-contiguous strides for a given shape."""
strides = []
stride = itemsize
for s in reversed(shape):
strides.append(stride)
stride *= s
return tuple(reversed(strides))
print(f"strides for (3, 4): {compute_strides((3, 4))}") # (4, 1)
print(f"strides for (2, 2, 3): {compute_strides((2, 2, 3))}") # (6, 3, 1)Transpose: Swap Strides, Not Data
a = np.arange(6).reshape(2, 3)
print(f"Original: shape={a.shape}, strides={a.strides}")
# [[0, 1, 2],
# [3, 4, 5]]
b = a.T
print(f"Transposed: shape={b.shape}, strides={b.strides}")
# [[0, 3],
# [1, 4],
# [2, 5]]
print(f"Same memory: {np.shares_memory(a, b)}")The strides are swapped: (24, 8) becomes (8, 24). Now “move along rows” jumps 1 element (8 bytes) and “move along columns” jumps 3 elements (24 bytes). The data is unchanged.
# Prove it: both index the same memory location
print(f"a[1, 2] = {a[1, 2]}") # row 1, col 2 = 5
print(f"b[2, 1] = {b[2, 1]}") # row 2, col 1 = 5 (same element)
# Same offset calculation:
# a[1, 2] → offset = 1*3 + 2*1 = 5
# b[2, 1] → offset = 2*1 + 1*3 = 5 ✓Contiguous vs Non-Contiguous
A tensor is contiguous if its strides match what a freshly allocated array would have (row-major, C-order):
a = np.arange(6).reshape(2, 3)
print(f"a is contiguous: {a.flags['C_CONTIGUOUS']}") # True
b = a.T
print(f"a.T is contiguous: {b.flags['C_CONTIGUOUS']}") # False!
# To make it contiguous, we must copy the data
c = np.ascontiguousarray(b)
print(f"copy is contiguous: {c.flags['C_CONTIGUOUS']}") # True
print(f"Same memory: {np.shares_memory(b, c)}") # False — had to copyWhy contiguous matters
GPU kernels assume contiguous memory. When you pass a non-contiguous tensor to a kernel, the framework must either:
- Copy to contiguous memory first (tinygrad: CONTIGUOUS op)
- Generate a kernel that handles non-contiguous access (slower)
This is why tinygrad’s ShapeTracker checks if a chain of movement ops results in a contiguous layout. If yes → no copy needed.
Expand (Broadcast): Stride of Zero
Broadcasting uses a stride of 0 — “moving” along the broadcast dimension always returns to the same data:
a = np.array([[1], [2], [3]]) # shape (3, 1)
print(f"Before expand: shape={a.shape}, strides={a.strides}")
b = np.broadcast_to(a, (3, 4))
print(f"After expand: shape={b.shape}, strides={b.strides}")
# strides = (8, 0) — stride of 0 in the expanded dimension!
print(b)
# [[1, 1, 1, 1],
# [2, 2, 2, 2],
# [3, 3, 3, 3]]
# NO new memory allocated — the "copies" are virtual
print(f"Same memory: {np.shares_memory(a, b)}")This is how a bias (1, 128) gets “added” to a batch (32, 128) without creating 32 copies.
Slice (Shrink): Offset the Pointer
a = np.arange(10)
print(f"Original: data={a}, strides={a.strides}")
b = a[3:7] # shrink: start at offset 3, length 4
print(f"Slice: data={b}, strides={b.strides}")
print(f"Same memory: {np.shares_memory(a, b)}")
# b points to the same buffer but starts at element 3
# offset = 3 * stride = 3 * 8 = 24 bytes into the bufferNo copy — just change the starting pointer and shape.
Flip: Negative Stride
a = np.arange(5)
print(f"Original: data={a}, strides={a.strides}")
b = a[::-1]
print(f"Flipped: data={b}, strides={b.strides}")
# strides = (-8,) — negative! Walking forward in the view walks backward in memory
print(f"Same memory: {np.shares_memory(a, b)}")Build a Mini ShapeTracker
tinygrad tracks all shape changes symbolically, only copying data when necessary:
class ShapeTracker:
"""Track shape transformations without touching data."""
def __init__(self, shape):
self.shape = shape
self.strides = compute_strides(shape)
self.offset = 0
self.ops = []
def reshape(self, new_shape):
assert np.prod(new_shape) == np.prod(self.shape), "element count must match"
self.shape = new_shape
self.strides = compute_strides(new_shape)
self.ops.append(f"reshape{new_shape}")
return self
def permute(self, axes):
self.shape = tuple(self.shape[a] for a in axes)
self.strides = tuple(self.strides[a] for a in axes)
self.ops.append(f"permute{axes}")
return self
def expand(self, new_shape):
new_strides = []
for old_s, old_st, new_s in zip(self.shape, self.strides, new_shape):
if old_s == new_s:
new_strides.append(old_st)
elif old_s == 1:
new_strides.append(0) # broadcast: stride = 0
else:
raise ValueError(f"Can't expand {old_s} to {new_s}")
self.shape = new_shape
self.strides = tuple(new_strides)
self.ops.append(f"expand{new_shape}")
return self
def is_contiguous(self):
expected = compute_strides(self.shape)
return self.strides == expected
def index(self, *indices):
"""Compute flat offset for given indices."""
return self.offset + sum(i * s for i, s in zip(indices, self.strides))
def __repr__(self):
return (f"ShapeTracker(shape={self.shape}, strides={self.strides}, "
f"contiguous={self.is_contiguous()}, ops={self.ops})")
# Demo: chain of transformations
st = ShapeTracker((2, 3, 4))
print(st)
st.reshape((6, 4))
print(st)
st.permute((1, 0))
print(st)
# Not contiguous! A copy would be needed here for GPU execution.
print(f"\nElement at [2, 3]: offset = {st.index(2, 3)}")When Does a Copy Happen?
def needs_copy(tracker):
"""Check if the current view requires a memory copy."""
if not tracker.is_contiguous():
return True
if any(s == 0 for s in tracker.strides):
return True # expanded tensors need materialization for writes
return False
# Case 1: reshape only → no copy
st1 = ShapeTracker((12,))
st1.reshape((3, 4))
print(f"Reshape only: needs_copy={needs_copy(st1)}") # False
# Case 2: reshape + permute → copy needed
st2 = ShapeTracker((3, 4))
st2.permute((1, 0))
print(f"After permute: needs_copy={needs_copy(st2)}") # True
# Case 3: expand → depends on usage
st3 = ShapeTracker((3, 1))
st3.expand((3, 4))
print(f"After expand: needs_copy={needs_copy(st3)}") # True (stride=0)tinygrad’s allocator (contiguous_mops_to_view) checks if a chain of movement ops collapses to a single contiguous view. If yes, it creates a BUFFER_VIEW (free). If no, it inserts a CONTIGUOUS operation (memory copy).
Why This Matters for Performance
import time
n = 10_000_000
# Contiguous access — fast (cache-friendly)
a = np.arange(n, dtype=np.float64)
start = time.perf_counter()
_ = a.sum()
t_contig = time.perf_counter() - start
# Non-contiguous access — slower (cache misses)
b = np.arange(n * 2, dtype=np.float64)[::2] # stride = 2
start = time.perf_counter()
_ = b.sum()
t_strided = time.perf_counter() - start
print(f"Contiguous sum: {t_contig*1000:.2f} ms")
print(f"Strided sum: {t_strided*1000:.2f} ms")
print(f"Ratio: {t_strided/t_contig:.1f}x slower")GPUs make this even more dramatic — non-coalesced memory access can be 10-100x slower.
Exercises
-
Row-major vs column-major: Create a Fortran-order array (
np.array(..., order='F')). Print its strides. How do they differ from C-order? When does this matter? -
As-strided tricks: Use
np.lib.stride_tricks.as_stridedto create a sliding window view of a 1D array — shape(n-k+1, k)with overlapping windows. This is how convolution extracts patches. -
Track all movement ops: Extend
ShapeTrackerwithshrink(ranges),pad(padding), andflip(axes). Implementflipas negative strides. -
Benchmark: Create a large matrix, transpose it, and sum along axis 0. Compare time with vs without
.copy()before the sum. The copy + contiguous sum can be faster than the strided sum. -
PyTorch strides: In PyTorch, print
tensor.stride()after.reshape(),.T,.expand(),.contiguous(). Verify the patterns match what we learned.
Next: 28 - Build a Tensor Library — combine autograd, strides, and broadcasting into a working tensor class.