A deeper look
To fully understand and appreciate the speculative cascades approach, we first compare cascades and speculative decoding with a simple example. Imagine you ask an LLM a straightforward question:
Prompt: “Who is Buzz Aldrin?“
Let’s say we have two models available to answer this: a small, fast “drafter” model and a large, powerful “expert” model.
Here’s how they might respond:
- Small Model: Buzz Aldrin is an American former astronaut, engineer, and fighter pilot, best known as the second person to walk on the Moon.
- Large Model: Edwin “Buzz” Aldrin, a pivotal figure in the history of space exploration, is an American former astronaut, engineer, and fighter pilot who is best known for being the second human to walk on the Moon.
Both models provide excellent, factually correct answers, but they interpret the user’s intent slightly differently. The small model delivers a quick, factual summary, while the large model provides a more formal, encyclopedic-style entry. Depending on the user’s need — be it a fast fact or a detailed overview — either response could be considered ideal. The key is that they represent two distinct, equally valid styles.
Now, let’s see how the two main speed-up techniques handle this scenario.
With cascades, the small “drafter” model gets the prompt first. If it’s confident in its answer, it replies. If not, it defers the entire task to the large “expert” model.
In our example:
- The small model generates its concise and correct answer.
- It checks its confidence and, finding it high, sends the response to the user.
This works! We get a great answer quickly. But the process is sequential. If the small model hadn’t been confident, we would have wasted time waiting for it to finish, only to then start the large model from scratch. This sequential “wait-and-see” approach is a fundamental bottleneck.
With speculative decoding, the small model quickly drafts the first few tokens of the answer, and the large model verifies it in parallel, correcting the first mistake it finds.
In our example:
- The small model drafts the beginning of its answer: [Buzz, Aldrin, is, an, …]
- The large model verifies this draft. Its own preferred first token is Edwin.
- Since Buzz ≠ Edwin, the very first token is a mismatch.
- The entire draft is rejected and the first token is replaced with Edwin. The process then repeats from this corrected point to generate the rest of the answer, but the initial speed advantage has been lost.
Even though the small model produced a good answer, the requirement to match the large model token-by-token forces a rejection. We lose the speed benefit and end up with an answer that is not necessarily superior. While the above example uses a simple token matching rejection rule, in the full paper, we also include the potential for a “probabilistic match” that provides greater flexibility in the token-by-token comparison.