Speculative Decoding: The Art of Being Good Enough
may 2026Large language models are slow for a boring reason: they have to write one token at a time. When you ask a model to write an answer, it does not produce the whole response at once. It predicts the next token, appends it to the context, and then predicts the next one conditioned on everything generated so far. This process repeats until the completion is finished.
This process is called decoding. Decoding is where the model turns probabilities into text. It is also where a lot of waiting happens.
Speculative decoding is a trick for making this faster. The idea is simple:
- A small model guesses a few tokens ahead.
- A large model checks those guesses.
- The system keeps the guesses that are good enough.
The small model does not need to be perfect. It only needs to be right often enough to save time.
Why LLMs Are Slow
Large language models are autoregressive. That means they generate text from left to right. Each new token depends on the tokens before it. This seems natural. It is how we read text. It is also how we write text.
But it creates a systems problem. The model cannot safely generate token 20 before it knows token 19. Token 20 depends on token 19. Token 19 depends on token 18. The dependency chain is long. This makes decoding sequential.
So decoding becomes a loop:
- Run the model.
- Pick one token.
- Add that token to the context.
- Run the model again.
- Repeat.
If the answer has 200 tokens, the model may need hundreds of sequential steps. That is why LLMs can feel slow even on powerful hardware.
Normal Decoding in Plain English
Before speculative decoding, we need normal decoding. At each step, the model outputs a list of scores for possible next tokens. These scores are called logits.
A logit is not a probability yet. It is a raw score. The system converts logits into probabilities using a function called softmax, which exponentiates the logits and normalizes them so that all probabilities sum to 1. After that, the model has a probability distribution over the next token.
Softmax converts the model's raw scores into probabilities. Here, z_i is the score for a specific token, and the denominator adds up the scores for all possible tokens. The result is the probability that the model assigns to token_i.
A probability distribution tells us how likely each token is. For the prompt:
The capital of France is
The model might assign:
| Token | Probability |
|---|---|
| Paris | high |
| Lyon | lower |
| banana | almost none |
The decoding algorithm then chooses one token. Greedy decoding always picks the most likely token. Sampling picks from the probability distribution. Temperature controls how random sampling feels. Top-k sampling limits the choice to the k most likely tokens. Top-p sampling, also called nucleus sampling, limits the choice to a small group of tokens whose combined probability reaches some threshold.
These methods affect style and randomness. But they all share the same bottleneck: they usually generate one token per large model call. Speculative decoding tries to generate more than one useful token per large model call.
The Core Idea: Guess First, Verify Later
Speculative decoding uses two models. The first model is the draft model. It is small and fast. Its job is to guess the next few tokens. The second model is the target model. It is large and accurate. Its job is to check the draft.
The target model is the model we actually care about. We want the final answer to behave like it came from this model. The draft model is only an assistant.
The draft model writes in pencil. The target model reviews in pen.
If the draft is good, the target model accepts it. If the draft is bad, the target model fixes it.
The speedup comes from a useful property of transformers. A transformer is the neural network architecture behind most modern LLMs. During decoding, a transformer is slow when it has to run again and again for one token at a time. But it can score several proposed tokens in parallel once those tokens are available.
That gives us the opening. Instead of asking the large model for one token, we let the small model propose several tokens. Then the large model checks those tokens in one verification pass. If the large model accepts three or four tokens, we have moved forward several steps while paying for only one large-model verification call.
That is the win.
Speculative Decoding Step by Step
Speculative decoding has two phases:
- Prepare or choose a draft model.
- Use the draft model during inference.
The important part happens during inference. But the draft model matters a lot. A bad draft model wastes time. A slow draft model wastes time. A good draft model saves time because the target model accepts more of its guesses.
Where the Draft Model Comes From
The draft model can come from several places. It can be a smaller model from the same family as the target model. For example, a small code model may draft for a large code model.
It can be a distilled model. Distillation means training a smaller model to imitate a larger model. The large model is often called the teacher. The small model is called the student.
It can be a special lightweight module attached to the target model. This is sometimes called a speculator or draft head. It can even be a simpler method, such as an n-gram guesser.
An n-gram is a sequence of n tokens. For example, "once upon a" is a 3-gram. If a phrase appears often, an n-gram method may guess the next token from repeated patterns.
The draft model does not need to be a genius. It needs to be aligned with the target model in the places where text is predictable. That is the key: the draft model is useful when it often predicts tokens the target model would have picked anyway.
What Happens During Inference
Assume the target model has already produced this prefix:
The capital of France is
Now the draft model proposes four tokens:
Paris, and, it, is
This proposed sequence is called a draft. The number of proposed tokens is called the draft length.
Now the target model checks the draft. The target model does not just say "yes" or "no" to the whole sequence. It scores each token under the target model's own probability distribution.
A target distribution is the probability distribution produced by the target model. A draft distribution is the probability distribution produced by the draft model. The system compares the draft token against what the target model would allow.
If the draft token is acceptable, the system keeps it. If the draft token is not acceptable, the system rejects it and asks the target model to provide a replacement. Once a token is rejected, the later draft tokens usually get discarded. They were based on a future that no longer exists.
Here is a simple example.
Prompt:
The capital of France is
Draft model proposes:
Paris and it is
The target model verifies:
| Position | Draft token | Target decision | Result |
|---|---|---|---|
| 1 | Paris | Accept | Keep "Paris" |
| 2 | and | Reject | Replace with target token |
| 3 | it | Discard | Not used |
| 4 | is | Discard | Not used |
The model moves forward by one accepted token. That is not a huge win. But if the target accepts more tokens, the benefit grows.
Now try a better draft.
Prompt:
The capital of France is
Draft model proposes:
Paris, a city
The target model verifies:
| Position | Draft token | Target decision | Result |
|---|---|---|---|
| 1 | Paris | Accept | Keep |
| 2 | , | Accept | Keep |
| 3 | a | Accept | Keep |
| 4 | city | Accept | Keep |
Now the system moves forward by four tokens after one large-model verification pass. That is a real speedup.
The Technical Acceptance Rule
The exact rule depends on the decoding method. For greedy decoding, the idea is easy. If the draft model proposes the same token the target model would have chosen, accept it. If not, reject it.
Sampling is more subtle. With sampling, the target model does not have just one correct next token. It has a probability distribution. Several tokens may be valid.
A common speculative sampling rule uses rejection sampling. Rejection sampling is a method for accepting samples from one distribution in a way that makes them behave like samples from another distribution.
Let:
qbe the draft model's probability for a token.pbe the target model's probability for the same token.
If the draft model proposes a token, the system accepts it with probability roughly based on the ratio p / q. If the target model likes the token at least as much as the draft model does, the token is easy to accept. If the draft model overestimated the token, the token may be rejected.
When the system rejects a token, it samples a replacement from an adjusted target distribution. The goal is important: the final output should still behave as if it came from the target model.
This is why speculative decoding is not just "use a smaller model."
The smaller model proposes. The larger model disposes.
Why the Target Model Can Check Several Tokens at Once
This part is the main systems trick. The draft model generates several tokens cheaply. The target model then receives the current prefix plus the draft tokens. It can compute probabilities for those draft positions in parallel.
This does not mean the target model ignores token order. It still respects order. Token 4 depends on token 3. But during verification, the draft has already provided token 3. So the target model can score the proposed continuation in a single forward pass.
A forward pass is one run through the neural network. The target model still does expensive work. But it may do less expensive work per accepted token. That is what matters.
The "Good Enough" Tradeoff
Speculative decoding works because the draft model is cheaper than the target model. But cheap is not enough. The draft model must also be good enough.
There are three forces:
| Force | Why it matters |
|---|---|
| Speed | If the draft model is slow, it adds overhead. |
| Accuracy | If the target rejects most draft tokens, the system wastes time. |
| Target cost | If the target model is already small or heavily optimized, speculative decoding may not help much. |
This creates the central tradeoff. A tiny draft model is fast but often wrong. A large draft model is accurate but expensive. The best draft model sits in the middle. It is not the smartest model. It is the model that creates the most accepted tokens per unit of extra compute.
That is the real objective: not accuracy alone, not speed alone, but accepted useful work.
This is why the phrase "good enough" matters. The draft model is allowed to be wrong. The target model will catch it. The draft model just cannot be wrong too often.
A draft length of one is not very useful. The system only guesses one token ahead. A moderate draft length can work well. A very long draft length can waste work. If the target model rejects an early token, the later draft tokens get discarded. The lesson: more guessing is not always better.
The whole method depends on accepted tokens per verification pass. If the target accepts one token per pass, speculative decoding does little. If the target accepts four tokens per pass, speculative decoding can help a lot.
When Speculative Decoding Works Best
Speculative decoding works best when generation has predictable stretches. Examples include:
- boilerplate code,
- common phrases,
- structured formats,
- repetitive text,
- simple factual explanation,
- low-temperature generation.
It also helps more when the target model is large and decoding is limited by memory movement or sequential latency.
A memory-bound workload is limited by how fast the system can move data, not by how many raw computations it can perform. LLM decoding can become memory-bound because each token requires the model to read large amounts of model state and cached context.
A KV cache is stored attention data from previous tokens. It helps the model avoid recomputing everything from scratch. But the cache also uses memory, and reading it can become expensive.
Speculative decoding can reduce the number of target-model decoding steps. That can reduce waiting time.
A vector is a list of numbers. Models use vectors to represent tokens, meanings, and internal states. A hidden state is an internal vector produced by a model layer. Cosine similarity measures whether two vectors point in a similar direction. A value near 1 means they are closely aligned. A value near 0 means they are less aligned.
Vector similarity can show where the draft and target models seem to agree internally. But do not overstate it. Vector similarity is a diagnostic. It is not the acceptance rule itself. The actual acceptance rule depends on token probabilities.
Conclusion: Good Enough Is a Systems Idea
Speculative decoding is powerful because it changes the unit of progress:
| Normal decoding | Speculative decoding |
|---|---|
| What is the next token? | Which guessed tokens can we keep? |
That small change matters. The target model no longer has to invent every token from scratch. It can verify a short future proposed by a cheaper model. When the draft is good, the system moves several tokens forward. When the draft is bad, the target model corrects it.
The whole method rests on a practical insight: you do not always need the best model to make the first guess. You need the best model to make the final call.
That is speculative decoding. It is not about being perfect. It is about being good enough often enough to be fast.