State Space Models

What

SSMs are an alternative to transformers for sequence modeling, inspired by continuous-time state space models from control theory. They process sequences in linear time O(n) rather than quadratic O(n²) attention, making them attractive for very long sequences (100k+ tokens).

The core idea: represent a sequence as a dynamical system where a hidden state h_t evolves over time, and an output y_t is produced from the state.

The Continuous Foundation: HiPPO

Before SSMs were adapted for deep learning, the HiPPO framework (High-order Polynomial Projection Operators, Gu et al., 2020) established that continuous-time signals could be represented as expansions in orthogonal polynomial bases. The key insight: when discretized, this yields recurrence relations that can be computed efficiently.

S4 (Structured State Space Sequence model, Gu et al., 2021) was the first to apply this theory to deep learning, using:

Structured matrices (diagonal + low-rank) for efficient computation
Initialization based on HiPPO-LegS (Legendre polynomials scaled by time)
Computational path via the parallel scan algorithm for linear-time recurrence

Mamba (2023): Selective State Spaces

The breakthrough architecture. Key innovations:

Input-dependent parameters

Unlike S4’s fixed A, B, C matrices, Mamba makes them input-dependent by选择性 (selectively) scanning the input:

h_t = A(x_t) × h_{t-1} + B(x_t) × x_t
y_t = C(x_t) × h_t

This is analogous to how attention selectively attends to different positions — Mamba selectively chooses what to remember from the past state.

Hardware-aware algorithm

Standard SSM recurrence is slow on GPU memory hierarchies. Mamba uses:

Parallel scan for forward pass (exploits GPU parallelism)
Chunk-wise computation that keeps intermediate states in fast SRAM
Kernel fusion to avoid materializing large matrices

The SSM recurrence in code

# Simplified Mamba step
def mamba_step(x, params):
    A, B, C, dt = params['A'], params['B'], params['C'], params['dt']
    
    # Discretize: dt * A acts as the decay rate per dimension
    dA = exp(dt * A)  # shape: (d_model, d_state)
    dB = dt * B        # shape: (d_model, d_state)
    
    # Parallel scan for recurrence (all timesteps in parallel)
    h = parallel_scan(dA, dB @ x)  # shape: (seq_len, d_state)
    
    # Output projection
    y = h @ C  # shape: (seq_len, d_model)
    return y

Mamba-2 (2024): State Space Duality

The key theoretical result: selective SSMs and structured attention are mathematically equivalent (state space duality, SSD).

This means:

The attention mechanism can be rewritten as a special case of SSM computation
Mamba-2 exploits this to use tensor cores (matrix multiplication) instead of the parallel scan
Result: 2-8x faster training while maintaining equivalent quality

Architecture changes from Mamba-1 to Mamba-2

Larger state dimension (d=16 in Mamba-1 → d=64-256 in Mamba-2)
Different normalization strategy (RMSNorm instead of channel norm)
Simplified selection mechanism with better理论基础

Paper: arXiv:2405.21060

Jamba: Hybrid SSM + Attention

AI21’s production model combining:

Mamba layers (SSM for efficient long-range dependency)
Attention layers (for local pattern recognition)
MoE layers (for parameter efficiency)

Configuration: 52B total parameters, 12B active (per token), 256K context window.

The hybrid approach is currently the most production-viable: pure SSM models still lag transformers on complex reasoning tasks, but the combination outperforms either alone at the same compute budget.

RWKV: Linear RNN Alternative

RWKV (Recurrent Weighted Key-Value, Peng et al., 2023/2024) takes a different approach:

Parallelizable training like transformers
O(1) inference like traditional RNNs (state size doesn’t grow with sequence length)
Linear attention reformulation with time-mixing and channel-mixing
Scaled to 14B parameters

Paper: arXiv:2305.13048

Why SSMs Matter

Architecture	Forward Pass	Inference Memory	Training
Transformer (full attention)	O(n²)	O(n) context	Parallel
Mamba (selective SSM)	O(n)	O(n) state	Parallel (via scan)
RWKV (linear RNN)	O(1) per step	O(1) state	Parallel
Traditional RNN	O(n) sequential	O(1) state	Sequential

The practical advantage: SSMs can process 1M+ token contexts at a fraction of transformer cost, with comparable quality on most tasks.

Practical Guidance (2025)

For new projects: Consider hybrid architectures (Mamba + attention) or pure Mamba for long-context tasks
For best reasoning quality: Transformers still lead on complex multi-step reasoning
For very long sequences: Mamba-2 with 128K+ context is the practical choice
For mobile/edge: RWKV’s O(1) inference is advantageous
State dimension: Higher is better for complex tasks but more memory; d=64-128 is common for 7B-scale models

Key Papers

Efficiently Modeling Long Sequences with Structured State Spaces (Gu et al., 2021, ICLR) — S4, the foundational SSM · arXiv:2111.00396
Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Gu & Dao, 2023) · arXiv:2312.00752
Transformers are SSMs (Dao & Gu, 2024, ICML) — Mamba-2 SSD theory · arXiv:2405.21060
RWKV: Reinventing RNNs for the Transformer Era (Peng et al., 2023, EMNLP) · arXiv:2305.13048
From S4 to Mamba: A Comprehensive Survey on Structured State Space Models (Somvanshi et al., 2025) — 30-page survey covering S4, Mamba, S5, Jamba · arXiv:2503.18970

AI/ML Notes

Explorer

State Space Models

State Space Models

What

The Continuous Foundation: HiPPO

Mamba (2023): Selective State Spaces

Input-dependent parameters

Hardware-aware algorithm

The SSM recurrence in code

Mamba-2 (2024): State Space Duality

Architecture changes from Mamba-1 to Mamba-2

Jamba: Hybrid SSM + Attention

RWKV: Linear RNN Alternative

Why SSMs Matter

Practical Guidance (2025)

Key Papers

Links

Graph View

Table of Contents

Backlinks