SOTA: When Good Enough Isn't Good Enough
may 2026
Small models and state-of-the-art models can look surprisingly similar.
Give them the same prompt and their logits may be close. They may rank the same tokens near the top. They may agree on most of the easy words. They may both know how to finish common phrases, write fluent sentences, and produce answers that sound reasonable.
But one model feels like autocomplete.
The other feels like intelligence.
That gap is the point of this article.
SOTA means state of the art. It usually means the strongest model available at a given time. But the difference between a small model and a SOTA model is not always a huge difference on every token.
Most tokens are easy.
The hard part is choosing correctly when several answers look plausible.
What Logits Are
At each step, a language model produces logits.
A logit is a raw score for a possible next token. If the vocabulary has 100,000 tokens, the model produces 100,000 scores. Each score says how much the model likes that token as the next one.
The logits are then converted into probabilities with softmax:
Here, z_i is the logit for one token. Softmax turns all the raw scores into a probability distribution. The probabilities add up to 1.
For the prompt:
The container crashed because it ran out of
a model should assign high probability to:
memory
It may assign lower probability to:
disk
time
power
All of those words can appear in technical writing. But only one is the obvious completion.
Small models usually handle obvious completions well. That is why they can look close to large models.
The difference shows up when the completion is not obvious.
Easy Tokens Hide the Gap
Most generated tokens are not hard.
Commas are easy. Common words are easy. Boilerplate is easy. If the prompt says:
The API returned a 404 because the resource was not
many models will choose:
found
There is not much intelligence required. The phrase is common.
This is why small models can seem strong. They get many easy tokens right. Their outputs are fluent. Their logits often resemble larger models on predictable text.
But language is not equally hard at every position.
Some positions are low entropy. The next token is obvious.
Some positions are high entropy. Many tokens could work.
Entropy is a measure of uncertainty. A low-entropy distribution has most of its probability on one token. A high-entropy distribution spreads probability across many tokens.
For example:
The model performed well offline but failed after deployment because the training data was
Possible continuations include:
stale
biased
leaked
synthetic
incomplete
unrepresentative
Now the model has to understand the situation. A generic answer may sound fine. A precise answer depends on the context.
This is where SOTA starts to matter.
Margins Matter
The margin is the gap between competing logits.
If one token has a much higher logit than every other token, the decision is easy. If two tokens have nearly the same logit, the decision is fragile.
Example:
| Token | Logit |
|---|---|
| biased | 14.21 |
| stale | 14.18 |
| incomplete | 13.97 |
The top two tokens are close. A small change can flip the ranking.
A small model might rank biased first. A stronger model might rank stale first because it noticed that the prompt was about data becoming old after deployment.
That looks like a tiny numerical difference.
But it changes the answer.
If the model chooses biased, the explanation goes toward fairness or sampling skew. If it chooses stale, the explanation goes toward distribution drift and changing production data.
Both answers can sound intelligent.
Only one may answer the prompt.
This is the difference between a plausible continuation and the right continuation.
Rank Is Often More Important Than Knowing
A small model may know the correct token exists.
That is not enough.
At inference time, the model has to choose. If the right token is ranked fifth, it may never appear. If it is ranked first, the answer moves in the right direction.
This matters because generation is sequential. The model writes one token, adds it to the context, and then writes the next token.
A wrong token changes the future context.
The model is not just choosing a word. It is choosing the path that the rest of the answer will follow.
That is why small rank changes can have large effects.
| Correct token rank | Likely result |
|---|---|
| 1 | The answer stays on track |
| 2-5 | The model may choose a fluent wrong path |
| Lower | The correct idea may never appear |
SOTA models are better at moving the right token up the list when the decision is hard.
That is a small technical change with a large user-facing effect.
The Tail Matters
People focus on the top token, but the full distribution matters.
A weaker model often gives too much probability to common answers. It likes safe phrases. It likes common explanations. It likes completions that sound familiar.
A stronger model is better at giving probability to rare but correct tokens.
That matters in technical domains.
In code, the right token might be:
await
instead of:
return
In systems, the right phrase might be:
memory bandwidth
instead of:
compute
In statistics, the right issue might be:
selection bias
instead of:
sample size
The weaker model may understand all of these phrases. The problem is not vocabulary.
The problem is probability allocation.
It knows the right answer is possible. It does not believe in it enough.
Cross-Entropy Is a Better Clue Than One Output
Training usually optimizes cross-entropy loss.
Cross-entropy measures how much probability the model assigns to the correct next token. If the model gives the correct token high probability, the loss is low. If it gives the correct token low probability, the loss is high.
This is important because two models can produce the same visible token but still differ.
Suppose the correct next token is bandwidth.
| Model | Probability on bandwidth |
|---|---|
| Small model | 34% |
| SOTA model | 61% |
If both models use greedy decoding, both may output bandwidth.
The visible output is the same.
But the SOTA model is less uncertain. It has a better-shaped distribution. It is more likely to stay correct when the prompt gets harder or the sampling temperature changes.
This is why raw examples can be misleading. One output does not show the full distribution.
The logits show how close the model was to making a mistake.
Compounding Turns Small Errors Into Big Errors
A model answer is a long chain of token choices.
One weak choice may not matter. Ten weak choices start to matter. Fifty weak choices can change the whole answer.
This is why small models often start strong and then drift.
They open well. They use the right tone. They produce fluent sentences. But over time they lose constraints. They simplify the problem. They merge ideas that should stay separate. They answer a nearby question instead of the actual one.
The answer still sounds good.
That is the danger.
Fluency hides drift.
SOTA models are better because they make fewer small wrong turns. They preserve more context. They maintain the structure of the task for longer.
They are not perfect. They just fail less often in the places where failure compounds.
A Concrete Example
Prompt:
A cache made the system slower. Why might that happen?
A good-enough model might answer:
Caches usually make systems faster because they avoid recomputing expensive results. But they can use extra memory.
That is true, but weak. It mostly repeats the general purpose of caching.
A stronger model might answer:
The cache may have a low hit rate, so the system pays cache lookup and invalidation overhead without avoiding much work. It may also increase memory pressure, causing more garbage collection or eviction. In distributed systems, cache consistency can add network calls that are slower than recomputing locally.
The better answer is not just longer. It picked a better technical frame.
The important early tokens are things like:
hit rate
invalidation
memory pressure
consistency
Those tokens are not exotic. The small model knows them. But the SOTA model is more likely to put them near the top when the prompt calls for them.
That is the difference.
Similar Logits Do Not Mean Similar Reasoning
The logit vector is the final output of a lot of internal computation.
Before logits, the model builds hidden states. A hidden state is an internal vector that represents information from the current context. Attention moves information between tokens. MLP layers transform features. The final layer maps the hidden state into logits.
Two models can produce similar logits for an easy token while using different internal signals.
A small model may choose a word because the phrase is common.
A SOTA model may choose the same word because it tracked the actual dependency in the prompt.
On easy examples, both look the same.
On harder examples, the difference appears.
This is why surface similarity is not enough. The question is not whether the models agree on easy completions. The question is whether they still agree when the prompt requires precision.
Measuring Similarity Can Be Misleading
There are several ways to compare model distributions.
KL divergence measures how different two probability distributions are.
Top-k overlap checks whether two models place the same tokens in their top k predictions.
These metrics are useful, but they can hide the important cases.
If two models agree on thousands of easy tokens, their average similarity can look high. But the real question is where they disagree.
Do they disagree on commas?
Or do they disagree on the token that determines the answer?
Example:
| Rank | Small model | SOTA model |
|---|---|---|
| 1 | latency | bandwidth |
| 2 | bandwidth | latency |
| 3 | cache | cache |
| 4 | throughput | throughput |
The top tokens overlap. The models look similar.
But if the correct bottleneck is bandwidth, the SOTA model starts in the right place. The small model may frame the whole answer around latency.
High overlap does not mean equal quality.
Order matters.
Context matters.
The hard positions matter most.
Conclusion: The Last Few Percent Matter
Small models and SOTA models can have similar logits. That should not surprise us. Most tokens are easy. Most text is predictable.
The difference is concentrated in the hard positions.
Those positions have crowded distributions, small margins, rare correct tokens, and long-range constraints. A tiny change in logits can move the right token from rank three to rank one. That changes the next token. The next token changes the next context. The context changes the rest of the answer.
This is why a small numerical difference can become a large intelligence difference.
For easy work, good enough is good enough.
For hard work, good enough breaks.
SOTA is what you want when the answer depends on the edge.