Use speculative decoding for inference#

What is Speculative Decoding?#

Decoding is often a time-consuming step in large autoregressive models like Transformers. Common decoding strategies, such as greedy decoding or beam search, require running the model multiple times — generating K tokens can take n*K serial runs, where n is the size of the beam. Given the high cost of running large models, speculative decoding (also called assisted decoding) offers a more efficient alternative.

The key idea behind speculative decoding is to use a smaller, approximate model (called the draft or assistant model) to generate candidate tokens. These tokens are then validated in a single forward pass by the larger model, speeding up the overall decoding process. This approach achieves the same sampling quality as autoregressive decoding but with significantly reduced computation time — up to 2x faster for large models.

How Speculative Decoding Works:#

Draft Model: A smaller, more efficient model proposes tokens one at a time.
Target Model Verification: The larger model verifies these tokens in a single forward pass. It confirms correct tokens and corrects any incorrect ones.
Multiple Tokens Per Pass: Instead of generating one token per pass, speculative decoding processes multiple tokens simultaneously, reducing overall latency.

For more algorithmic details, check out the following papers:

Using Speculative Decoding in AutoGluon-RAG#

In AutoGluon-RAG, speculative decoding can be easily enabled with a few configuration lines. This is supported for both Huggingface models and vLLM-based models. The assistant and LLM model should also share the same tokenizer to avoid re-encoding and decoding tokens.

Speculative Decoding with Huggingface Models#

In Huggingface, the draft model is referred to as the “assistant model.”. In the Huggingface transformers framework, the parameter --assistant_model is used to specify the draft model.

To use speculative decoding in AutoGluon-RAG with Huggingface models, configure the assistant model like this:

generator_model_name: meta-llama/Llama-3.1-8B
generator_model_platform: huggingface
generator_model_platform_args:
  hf_generate_params:
    assistant_model: meta-llama/Llama-3.2-1B

Universal Assisted Decoding#

You may also use assistant model with different tokeniers from the target model. All you need to do is to explicitly specify the assistant tokenizer:

generator_model_name: google/gemma-2-9b
generator_model_platform: huggingface
generator_model_platform_args:
  hf_generate_params:
    assistant_model: double7/vicuna-68m
    assistant_tokenizer: double7/vicuna-68m

Note: Transformers v4.46.0 or above is required to support universal assisted decoding.

Speculative Decoding with vLLM Models#

Speculative decoding with vLLM is also straightforward. Here is an example configuration that sets up vLLM in offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time:

generator_model_name: facebook/opt-6.7b
generator_model_platform: vllm
generator_model_platform_args:
  vllm_init_params:
    speculative_model: facebook/opt-125m
    num_speculative_tokens: 5

With these configurations, AutoGluon-RAG provides an efficient way to speed up text generation while preserving the quality of the output.

Summary: Speculative decoding is a technique used to speed up the decoding process of large autoregressive models, such as Transformers. By using a smaller, approximate model (draft or assistant model) to propose candidate tokens and then verifying them with the larger model in a single forward pass, this method improves text generation speed while maintaining similar sampling quality.