Speculative decoding is an inference acceleration technique that uses a smaller, faster draft model to propose multiple tokens that a larger target model verifies in parallel, reducing the number of expensive forward passes required for generation. Standard autoregressive generation runs one forward pass per token, for a 100-token response, that's 100 sequential forward passes through the target model. Speculative decoding changes this: the draft model rapidly generates several candidate tokens (typically 3-8), then the target model evaluates all candidates in a single forward pass. Accepted tokens are kept; the first rejected token triggers re-generation from that point. If the draft model is well-aligned with the target model (predicts similar distributions), most candidates get accepted and throughput improves sharply. The technique exploits two key observations: verification is cheaper than generation (checking if tokens are acceptable requires one forward pass regardless of how many tokens), and small models can often predict what large models would generate (simple continuations are predictable). Speed improvements of 2-3x are common, with quality identical to standard decoding since rejected tokens are regenerated correctly. The draft model can be a separate small model, a subset of the target model's layers, or even a simple n-gram model. Speculative decoding is increasingly important as models grow larger and inference costs dominate.
Back to Glossary