Tensor Internals: Strides and Views

Goal: Understand how tensors are stored in memory, what strides are, why reshape/transpose don’t copy data, and what “contiguous” means. This is how tinygrad’s movement ops (RESHAPE, PERMUTE, EXPAND) work under the hood.

Prerequisites: Vectors and Matrices, NumPy Essentials, 26 - The Minimal Op Set


A Tensor Is Just a Flat Array + Metadata

A 2D matrix doesn’t exist in memory as a grid. It’s a flat array of numbers plus metadata that says how to interpret it:

import numpy as np
 
# This matrix:
# [[1, 2, 3],
#  [4, 5, 6]]
#
# Is stored in memory as: [1, 2, 3, 4, 5, 6]
 
a = np.array([[1, 2, 3], [4, 5, 6]])
print(f"Shape:   {a.shape}")        # (2, 3)
print(f"Strides: {a.strides}")      # (24, 8) bytes — 3*8=24 bytes per row, 8 bytes per float64
print(f"Data:    {a.ravel()}")       # [1, 2, 3, 4, 5, 6]

Strides tell you: “how many bytes to jump to move one step along each dimension.”

  • Stride for axis 0 (rows): 24 bytes = 3 elements × 8 bytes/element
  • Stride for axis 1 (cols): 8 bytes = 1 element × 8 bytes/element

How Indexing Works

To find element a[i, j] in the flat array:

def manual_index(data_flat, strides_elements, indices):
    """Index into a flat array using strides."""
    offset = sum(i * s for i, s in zip(indices, strides_elements))
    return data_flat[offset]
 
flat = [1, 2, 3, 4, 5, 6]
strides = [3, 1]  # in elements, not bytes
 
# a[1, 2] = flat[1*3 + 2*1] = flat[5] = 6
print(f"a[1, 2] = {manual_index(flat, strides, (1, 2))}")
 
# a[0, 1] = flat[0*3 + 1*1] = flat[1] = 2
print(f"a[0, 1] = {manual_index(flat, strides, (0, 1))}")

Reshape: Change Metadata, Not Data

a = np.arange(12)  # [0, 1, 2, ..., 11]
print(f"1D: shape={a.shape}, strides={a.strides}")
 
b = a.reshape(3, 4)
print(f"2D: shape={b.shape}, strides={b.strides}")
 
c = a.reshape(2, 2, 3)
print(f"3D: shape={c.shape}, strides={c.strides}")
 
# All share the SAME memory
print(f"Same memory: {np.shares_memory(a, b)} and {np.shares_memory(b, c)}")
a[0] = 999
print(f"After modifying a[0]: b[0, 0] = {b[0, 0]}")  # also 999

Reshape just changes shape and recalculates strides. Zero cost.

def compute_strides(shape, itemsize=1):
    """Compute C-contiguous strides for a given shape."""
    strides = []
    stride = itemsize
    for s in reversed(shape):
        strides.append(stride)
        stride *= s
    return tuple(reversed(strides))
 
print(f"strides for (3, 4):    {compute_strides((3, 4))}")   # (4, 1)
print(f"strides for (2, 2, 3): {compute_strides((2, 2, 3))}")  # (6, 3, 1)

Transpose: Swap Strides, Not Data

a = np.arange(6).reshape(2, 3)
print(f"Original:   shape={a.shape}, strides={a.strides}")
# [[0, 1, 2],
#  [3, 4, 5]]
 
b = a.T
print(f"Transposed: shape={b.shape}, strides={b.strides}")
# [[0, 3],
#  [1, 4],
#  [2, 5]]
 
print(f"Same memory: {np.shares_memory(a, b)}")

The strides are swapped: (24, 8) becomes (8, 24). Now “move along rows” jumps 1 element (8 bytes) and “move along columns” jumps 3 elements (24 bytes). The data is unchanged.

# Prove it: both index the same memory location
print(f"a[1, 2] = {a[1, 2]}")  # row 1, col 2 = 5
print(f"b[2, 1] = {b[2, 1]}")  # row 2, col 1 = 5 (same element)
 
# Same offset calculation:
# a[1, 2] → offset = 1*3 + 2*1 = 5
# b[2, 1] → offset = 2*1 + 1*3 = 5 ✓

Contiguous vs Non-Contiguous

A tensor is contiguous if its strides match what a freshly allocated array would have (row-major, C-order):

a = np.arange(6).reshape(2, 3)
print(f"a is contiguous: {a.flags['C_CONTIGUOUS']}")  # True
 
b = a.T
print(f"a.T is contiguous: {b.flags['C_CONTIGUOUS']}")  # False!
 
# To make it contiguous, we must copy the data
c = np.ascontiguousarray(b)
print(f"copy is contiguous: {c.flags['C_CONTIGUOUS']}")  # True
print(f"Same memory: {np.shares_memory(b, c)}")  # False — had to copy

Why contiguous matters

GPU kernels assume contiguous memory. When you pass a non-contiguous tensor to a kernel, the framework must either:

  1. Copy to contiguous memory first (tinygrad: CONTIGUOUS op)
  2. Generate a kernel that handles non-contiguous access (slower)

This is why tinygrad’s ShapeTracker checks if a chain of movement ops results in a contiguous layout. If yes → no copy needed.


Expand (Broadcast): Stride of Zero

Broadcasting uses a stride of 0 — “moving” along the broadcast dimension always returns to the same data:

a = np.array([[1], [2], [3]])  # shape (3, 1)
print(f"Before expand: shape={a.shape}, strides={a.strides}")
 
b = np.broadcast_to(a, (3, 4))
print(f"After expand:  shape={b.shape}, strides={b.strides}")
# strides = (8, 0) — stride of 0 in the expanded dimension!
print(b)
# [[1, 1, 1, 1],
#  [2, 2, 2, 2],
#  [3, 3, 3, 3]]
 
# NO new memory allocated — the "copies" are virtual
print(f"Same memory: {np.shares_memory(a, b)}")

This is how a bias (1, 128) gets “added” to a batch (32, 128) without creating 32 copies.


Slice (Shrink): Offset the Pointer

a = np.arange(10)
print(f"Original: data={a}, strides={a.strides}")
 
b = a[3:7]  # shrink: start at offset 3, length 4
print(f"Slice:    data={b}, strides={b.strides}")
print(f"Same memory: {np.shares_memory(a, b)}")
 
# b points to the same buffer but starts at element 3
# offset = 3 * stride = 3 * 8 = 24 bytes into the buffer

No copy — just change the starting pointer and shape.


Flip: Negative Stride

a = np.arange(5)
print(f"Original: data={a}, strides={a.strides}")
 
b = a[::-1]
print(f"Flipped:  data={b}, strides={b.strides}")
# strides = (-8,) — negative! Walking forward in the view walks backward in memory
print(f"Same memory: {np.shares_memory(a, b)}")

Build a Mini ShapeTracker

tinygrad tracks all shape changes symbolically, only copying data when necessary:

class ShapeTracker:
    """Track shape transformations without touching data."""
    def __init__(self, shape):
        self.shape = shape
        self.strides = compute_strides(shape)
        self.offset = 0
        self.ops = []
 
    def reshape(self, new_shape):
        assert np.prod(new_shape) == np.prod(self.shape), "element count must match"
        self.shape = new_shape
        self.strides = compute_strides(new_shape)
        self.ops.append(f"reshape{new_shape}")
        return self
 
    def permute(self, axes):
        self.shape = tuple(self.shape[a] for a in axes)
        self.strides = tuple(self.strides[a] for a in axes)
        self.ops.append(f"permute{axes}")
        return self
 
    def expand(self, new_shape):
        new_strides = []
        for old_s, old_st, new_s in zip(self.shape, self.strides, new_shape):
            if old_s == new_s:
                new_strides.append(old_st)
            elif old_s == 1:
                new_strides.append(0)  # broadcast: stride = 0
            else:
                raise ValueError(f"Can't expand {old_s} to {new_s}")
        self.shape = new_shape
        self.strides = tuple(new_strides)
        self.ops.append(f"expand{new_shape}")
        return self
 
    def is_contiguous(self):
        expected = compute_strides(self.shape)
        return self.strides == expected
 
    def index(self, *indices):
        """Compute flat offset for given indices."""
        return self.offset + sum(i * s for i, s in zip(indices, self.strides))
 
    def __repr__(self):
        return (f"ShapeTracker(shape={self.shape}, strides={self.strides}, "
                f"contiguous={self.is_contiguous()}, ops={self.ops})")
 
# Demo: chain of transformations
st = ShapeTracker((2, 3, 4))
print(st)
 
st.reshape((6, 4))
print(st)
 
st.permute((1, 0))
print(st)
# Not contiguous! A copy would be needed here for GPU execution.
 
print(f"\nElement at [2, 3]: offset = {st.index(2, 3)}")

When Does a Copy Happen?

def needs_copy(tracker):
    """Check if the current view requires a memory copy."""
    if not tracker.is_contiguous():
        return True
    if any(s == 0 for s in tracker.strides):
        return True  # expanded tensors need materialization for writes
    return False
 
# Case 1: reshape only → no copy
st1 = ShapeTracker((12,))
st1.reshape((3, 4))
print(f"Reshape only: needs_copy={needs_copy(st1)}")  # False
 
# Case 2: reshape + permute → copy needed
st2 = ShapeTracker((3, 4))
st2.permute((1, 0))
print(f"After permute: needs_copy={needs_copy(st2)}")  # True
 
# Case 3: expand → depends on usage
st3 = ShapeTracker((3, 1))
st3.expand((3, 4))
print(f"After expand: needs_copy={needs_copy(st3)}")  # True (stride=0)

tinygrad’s allocator (contiguous_mops_to_view) checks if a chain of movement ops collapses to a single contiguous view. If yes, it creates a BUFFER_VIEW (free). If no, it inserts a CONTIGUOUS operation (memory copy).


Why This Matters for Performance

import time
 
n = 10_000_000
 
# Contiguous access — fast (cache-friendly)
a = np.arange(n, dtype=np.float64)
start = time.perf_counter()
_ = a.sum()
t_contig = time.perf_counter() - start
 
# Non-contiguous access — slower (cache misses)
b = np.arange(n * 2, dtype=np.float64)[::2]  # stride = 2
start = time.perf_counter()
_ = b.sum()
t_strided = time.perf_counter() - start
 
print(f"Contiguous sum: {t_contig*1000:.2f} ms")
print(f"Strided sum:    {t_strided*1000:.2f} ms")
print(f"Ratio:          {t_strided/t_contig:.1f}x slower")

GPUs make this even more dramatic — non-coalesced memory access can be 10-100x slower.


Exercises

  1. Row-major vs column-major: Create a Fortran-order array (np.array(..., order='F')). Print its strides. How do they differ from C-order? When does this matter?

  2. As-strided tricks: Use np.lib.stride_tricks.as_strided to create a sliding window view of a 1D array — shape (n-k+1, k) with overlapping windows. This is how convolution extracts patches.

  3. Track all movement ops: Extend ShapeTracker with shrink(ranges), pad(padding), and flip(axes). Implement flip as negative strides.

  4. Benchmark: Create a large matrix, transpose it, and sum along axis 0. Compare time with vs without .copy() before the sum. The copy + contiguous sum can be faster than the strided sum.

  5. PyTorch strides: In PyTorch, print tensor.stride() after .reshape(), .T, .expand(), .contiguous(). Verify the patterns match what we learned.


Next: 28 - Build a Tensor Library — combine autograd, strides, and broadcasting into a working tensor class.