Speculative Decoding

Speculative decoding is an inference acceleration technique that uses a smaller, faster draft model to propose multiple tokens that a larger target model verifies in parallel, reducing the number of expensive forward passes required for generation.

Standard autoregressive generation runs one forward pass per token, for a 100-token response, that's 100 sequential forward passes through the target model. Speculative decoding changes this: the draft model rapidly generates several candidate tokens (typically 3-8), then the target model evaluates all candidates in a single forward pass.

Accepted tokens are kept; the first rejected token triggers re-generation from that point. If the draft model is well-aligned with the target model (predicts similar distributions), most candidates get accepted and throughput improves sharply.

The technique exploits two key observations: verification is cheaper than generation (checking if tokens are acceptable requires one forward pass regardless of how many tokens), and small models can often predict what large models would generate (simple continuations are predictable).

Speed improvements of 2-3x are common, with quality identical to standard decoding since rejected tokens are regenerated correctly. The draft model can be a separate small model, a subset of the target model's layers, or even a simple n-gram model. Speculative decoding is increasingly important as models grow larger and inference costs dominate.

Interactive Concept: speculative decoding

Watch how a fast draft model proposes multiple tokens that a larger model verifies in parallel, reducing sequential forward passes compared to standard autoregressive generation.

Standard Autoregressive Generation

Forward passes: 0Each token requires one sequential forward pass through the large model

Key Insight: Speculative decoding trades draft model computation (cheap) for target model sequential passes (expensive), achieving speedups when the draft model's proposals have reasonable acceptance rates.

Interactive Concept: speculative decoding