State Space Models
What
SSMs are an alternative to transformers for sequence modeling, inspired by continuous-time state space models from control theory. They process sequences in linear time O(n) rather than quadratic O(n²) attention, making them attractive for very long sequences (100k+ tokens).
The core idea: represent a sequence as a dynamical system where a hidden state h_t evolves over time, and an output y_t is produced from the state.
The Continuous Foundation: HiPPO
Before SSMs were adapted for deep learning, the HiPPO framework (High-order Polynomial Projection Operators, Gu et al., 2020) established that continuous-time signals could be represented as expansions in orthogonal polynomial bases. The key insight: when discretized, this yields recurrence relations that can be computed efficiently.
S4 (Structured State Space Sequence model, Gu et al., 2021) was the first to apply this theory to deep learning, using:
- Structured matrices (diagonal + low-rank) for efficient computation
- Initialization based on HiPPO-LegS (Legendre polynomials scaled by time)
- Computational path via the parallel scan algorithm for linear-time recurrence
Mamba (2023): Selective State Spaces
The breakthrough architecture. Key innovations:
Input-dependent parameters
Unlike S4’s fixed A, B, C matrices, Mamba makes them input-dependent by选择性 (selectively) scanning the input:
h_t = A(x_t) × h_{t-1} + B(x_t) × x_t
y_t = C(x_t) × h_t
This is analogous to how attention selectively attends to different positions — Mamba selectively chooses what to remember from the past state.
Hardware-aware algorithm
Standard SSM recurrence is slow on GPU memory hierarchies. Mamba uses:
- Parallel scan for forward pass (exploits GPU parallelism)
- Chunk-wise computation that keeps intermediate states in fast SRAM
- Kernel fusion to avoid materializing large matrices
The SSM recurrence in code
# Simplified Mamba step
def mamba_step(x, params):
A, B, C, dt = params['A'], params['B'], params['C'], params['dt']
# Discretize: dt * A acts as the decay rate per dimension
dA = exp(dt * A) # shape: (d_model, d_state)
dB = dt * B # shape: (d_model, d_state)
# Parallel scan for recurrence (all timesteps in parallel)
h = parallel_scan(dA, dB @ x) # shape: (seq_len, d_state)
# Output projection
y = h @ C # shape: (seq_len, d_model)
return yMamba-2 (2024): State Space Duality
The key theoretical result: selective SSMs and structured attention are mathematically equivalent (state space duality, SSD).
This means:
- The attention mechanism can be rewritten as a special case of SSM computation
- Mamba-2 exploits this to use tensor cores (matrix multiplication) instead of the parallel scan
- Result: 2-8x faster training while maintaining equivalent quality
Architecture changes from Mamba-1 to Mamba-2
- Larger state dimension (d=16 in Mamba-1 → d=64-256 in Mamba-2)
- Different normalization strategy (RMSNorm instead of channel norm)
- Simplified selection mechanism with better理论基础
Paper: arXiv:2405.21060
Jamba: Hybrid SSM + Attention
AI21’s production model combining:
- Mamba layers (SSM for efficient long-range dependency)
- Attention layers (for local pattern recognition)
- MoE layers (for parameter efficiency)
Configuration: 52B total parameters, 12B active (per token), 256K context window.
The hybrid approach is currently the most production-viable: pure SSM models still lag transformers on complex reasoning tasks, but the combination outperforms either alone at the same compute budget.
RWKV: Linear RNN Alternative
RWKV (Recurrent Weighted Key-Value, Peng et al., 2023/2024) takes a different approach:
- Parallelizable training like transformers
- O(1) inference like traditional RNNs (state size doesn’t grow with sequence length)
- Linear attention reformulation with time-mixing and channel-mixing
- Scaled to 14B parameters
Paper: arXiv:2305.13048
Why SSMs Matter
| Architecture | Forward Pass | Inference Memory | Training |
|---|---|---|---|
| Transformer (full attention) | O(n²) | O(n) context | Parallel |
| Mamba (selective SSM) | O(n) | O(n) state | Parallel (via scan) |
| RWKV (linear RNN) | O(1) per step | O(1) state | Parallel |
| Traditional RNN | O(n) sequential | O(1) state | Sequential |
The practical advantage: SSMs can process 1M+ token contexts at a fraction of transformer cost, with comparable quality on most tasks.
Practical Guidance (2025)
- For new projects: Consider hybrid architectures (Mamba + attention) or pure Mamba for long-context tasks
- For best reasoning quality: Transformers still lead on complex multi-step reasoning
- For very long sequences: Mamba-2 with 128K+ context is the practical choice
- For mobile/edge: RWKV’s O(1) inference is advantageous
- State dimension: Higher is better for complex tasks but more memory; d=64-128 is common for 7B-scale models
Key Papers
- Efficiently Modeling Long Sequences with Structured State Spaces (Gu et al., 2021, ICLR) — S4, the foundational SSM · arXiv:2111.00396
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Gu & Dao, 2023) · arXiv:2312.00752
- Transformers are SSMs (Dao & Gu, 2024, ICML) — Mamba-2 SSD theory · arXiv:2405.21060
- RWKV: Reinventing RNNs for the Transformer Era (Peng et al., 2023, EMNLP) · arXiv:2305.13048
- From S4 to Mamba: A Comprehensive Survey on Structured State Space Models (Somvanshi et al., 2025) — 30-page survey covering S4, Mamba, S5, Jamba · arXiv:2503.18970
Links
- Transformers — what SSMs compete with
- Attention Mechanism — the quadratic bottleneck SSMs avoid
- Key Papers — foundational papers
- Mixture of Experts — another efficiency technique for scaling