Speculative Decoding

What

Speed up LLM inference by using a small, fast “draft” model to propose multiple tokens, then the large model verifies them in parallel. No quality loss — mathematically equivalent to sampling from the target model.

How it works

Draft model (small, fast) generates k candidate tokens autoregressively
Target model (large, accurate) evaluates all k tokens in one forward pass (parallel)
Accept tokens that the target model agrees with, reject and resample where it disagrees
Guaranteed to produce the same distribution as the target model alone

Rejection sampling intuition

For each drafted token, compare the draft model’s probability q(x) with the target model’s probability p(x). Accept the token with probability min(1, p(x)/q(x)). If the draft model is confident and the target agrees, acceptance is near 100%. If the draft says something the target finds unlikely, it gets rejected and we resample from an adjusted distribution. This is standard rejection sampling — the math guarantees the output distribution is exactly p(x).

Why it works

LLM inference is memory-bandwidth bound, not compute bound
Verifying k tokens costs roughly the same as generating 1 (parallel attention)
Acceptance rate is typically 70-90% for well-matched draft models → 2-3x speedup

Draft model selection

The draft model needs to be (a) much faster than the target, and (b) have high token agreement. Good choices:

Same architecture family, smaller size (e.g., LLaMA-7B drafts for LLaMA-70B)
Distilled version of the target model
Models trained on similar data distributions

Bad match = low acceptance rate = no speedup. The draft model’s vocabulary must match the target’s.

Variants

Medusa — instead of a separate draft model, add multiple prediction heads to the target model itself. Each head predicts a different future position. No draft model needed, but requires finetuning the heads.

Self-speculative decoding — the target model drafts for itself using early exit (stop at layer 12 instead of 32 for the draft). Same weights, no extra model to manage.

Staged speculative decoding — chain multiple draft models of increasing size. Tiny model drafts, medium model verifies and extends, large model does final verification. Useful when the gap between draft and target is very large.

When it helps most

Speculative decoding shines in specific scenarios:

Batch size 1 — single-user inference where you’re memory-bound. At large batch sizes, you become compute-bound and the free lunch disappears
Long context — the KV cache is already loaded, so verification is cheap
High-agreement tasks — code completion, translation, and formulaic text have high acceptance rates
Streaming — users see tokens appearing faster

It helps less for: large batch serving (already compute-utilized), creative/diverse generation (low acceptance), very short outputs.

Key paper

Fast Inference from Transformers via Speculative Decoding (Leviathan et al., 2023)

AI/ML Notes

Explorer

Speculative Decoding

Speculative Decoding

What

How it works

Rejection sampling intuition

Why it works

Draft model selection

Variants

When it helps most

Key paper

Links

Graph View

Table of Contents

Backlinks