Speculative Decoding: The Mathematics of Acceptance Rate

Speculative Decoding is currently the closest thing we have to “cheating physics” in Machine Learning inference. When implemented correctly, it can make Large Language Models (LLMs) 3x faster with zero degradation in output quality.

When implemented poorly, it can actually make your models slower while burning twice the compute budget.

The difference between these two outcomes rests entirely on a single, critical metric: the acceptance rate ($\alpha$).

The Promise of Speculative Decoding

To understand speculative decoding, we must revisit the fundamental bottleneck of LLM generation: it is memory-bound. During autoregressive decoding, fetching the model weights from memory is the slow part.

Because modern GPUs are so massively parallel, evaluating a batch of 5 tokens in a single forward pass takes almost the exact same amount of time as evaluating just 1 token.

Speculative decoding exploits this hardware quirk through a two-model architecture:

  1. The Draft Model: A smaller, highly efficient draft model generates a sequence of candidate tokens very quickly.
  2. The Target Model: The massive, slow LLM takes those candidate tokens and efficiently verifies them all in a single forward pass.
  3. The Result: If the target model agrees with the draft model’s candidates, you get multiple tokens generated in the time it usually takes to generate one. This speeds up text generation dramatically without sacrificing any output quality.

The Production Reality: Acceptance Rate

Here is where most production deployments break down: engineering teams spend weeks optimizing the parameter size of the draft model before they optimize the acceptance rate ($\alpha$). This is backwards.

The acceptance rate ($\alpha$) is simply the probability that your draft model agrees with your target model on the next token.

If you use a generic, off-the-shelf 7B parameter model to draft for a 70B target model, and your $\alpha$ is below 40%, you are in trouble. You are paying the compute overhead of running two inference engines simultaneously, but the target model keeps rejecting the draft tokens. At that point, your system is running slower than if you had just used the target model alone.

The Decision Framework

You should not use speculative decoding blindly. Here is the decision framework I use when evaluating architectures:

  • Predictable Domains = High $\alpha$: If your LLM is generating code, writing structured JSON, or filling out repetitive templates, the next tokens are highly predictable. Your draft model will guess correctly frequently. Result: 2-3x speedup. Use it.
  • Creative Domains = Low $\alpha$: If your LLM is writing poetry, brainstorming novel ideas, or engaging in open-ended creative chat, the variance of possible next tokens is massive. The draft model will guess wrong constantly. Result: Slower inference. Turn it off.

The Real Leverage: Custom Draft Heads

The true unlock in speculative decoding is not finding the perfect open-source draft model; it is training a custom draft head specifically on your production data distribution.

A tiny 100M parameter model fine-tuned on your specific company traffic will drastically outperform a generic 7B model. Why? Because $\alpha$ depends on distribution alignment, not parameter count.

The math is eye-opening:

  • A generic 7B draft model at 35% $\alpha$ yields a ~1.2x speedup. The complexity is barely worth it.
  • A custom 100M draft model at 75% $\alpha$ yields a ~2.8x speedup. This is a game-changer for your cloud bill.

I’ve seen custom models trained on just 50,000 examples beat off-the-shelf 7B models by 40 percentage points in $\alpha$. Before you deploy speculative decoding, profile your domain predictability and know your acceptance rate.

Written on February 10, 2026