Support

FAQs

Answers to common questions about nano-PEARL.

⚡ Performance & Configuration

Why do I see different token counts when testing the same dataset? Performance

Short answer: This is expected behavior in PEARL benchmarking.

Detailed explanation:

For PEARL benchmarking, we run a fixed number of steps rather than a fixed token output. This design choice ensures that PEARL maintains a consistent batch size throughout the entire execution.

ℹ️
Why this matters: If we used fixed token output, early request completions would reduce the batch size, which would negatively impact throughput measurements. Running fixed steps prevents this issue and gives more accurate performance metrics.
Getting assertion errors at batch-size=32 on CNN/DM datasets? Config

Solution: Increase the max_num_batched_tokens parameter in your configuration.

Steps to fix:

  1. Open pearl_config.py
  2. Set max_num_batched_tokens = 65536
  3. Restart your benchmark
# In pearl_config.py

config = PEARLConfig(

    draft_model_path="...",

    target_model_path="...",

    max_num_batched_tokens=65536,  # Increase this value

    ...

)
Why is my speedup only 0.95x at batch-size=32 while smaller batches show 1.36x-2.8x? Performance

Root cause: This is typically due to compute-bound limitations at higher batch sizes, especially on certain GPU architectures like H20.

Recommended solutions:

  • Use larger draft models: Switch from 1.5B to 3B-8B parameter draft models
  • Reduce gamma values: Larger draft models typically produce better quality drafts, reducing the gamma overhead
  • Improve acceptance rates: Better draft models lead to higher acceptance rates and better overall throughput
⚠️
Hardware consideration: Some GPUs become compute-bound at higher batch sizes. If you consistently see degraded performance, consider adjusting your batch size or upgrading to GPUs with higher compute capacity.

📊 Terminology & Metrics

What does "AR" mean in the benchmark results?

AR stands for Auto-Regressive decoding.

In the context of nano-PEARL benchmarks, AR is used as a comparison baseline to calculate PEARL's speedup ratio. It represents the standard sequential token generation approach where each token is generated one after another, without speculation.

The speedup ratio is calculated as: Speedup = PEARL_throughput / AR_throughput

What is the MAT metric?

MAT stands for Mean Accept Tokens.

This metric originates from the original PEARL paper and indicates the equivalent vanilla speculative decoding performance. It measures the average number of tokens accepted per speculation round.

  • Higher MAT: Better alignment between draft and target models
  • Lower MAT: More rejections, indicating poor draft quality

MAT is a key indicator of how well your draft model is performing and whether your speculation strategy is effective.

🏗️ Architecture & Design

Can DT Disaggregation and Parallel Inference work independently? Architecture

No, they are the core of PEARL and cannot operate independently.

Why they're interdependent:

PEARL's architecture fundamentally relies on overlapping draft model computation with target model verification through "pre-verify" and "post-verify" mechanisms. These features work together to:

  • Reduce draft model runtime overhead
  • Maintain throughput at larger batch sizes
  • Enable adaptive draft length adjustments
💡
Design insight: The "DT disaggregation" functions as a parallel implementation mechanism. The adaptive draft length naturally emerges from the combined effects of pre-verify and post-verify operations.
What are the "pre-verify" and "post-verify" mechanisms? Architecture

These are PEARL's core coordination mechanisms between the draft and target models:

  • Pre-verify: The target model can interrupt the draft model when alignment is poor, preventing the generation of low-quality draft tokens
  • Post-verify: When alignment is good, the draft model continues generating without interruption, maximizing speculation efficiency

Combined effect: This creates the adaptive draft length feature - automatically adjusting how many tokens to speculate based on real-time alignment quality.

Where can I find detailed ablation studies? Architecture

Primary source: The original PEARL paper contains comprehensive ablation studies.

📄 Paper reference: PEARL: Parallel Speculative Decoding with Adaptive Draft Length

Why nano-PEARL focuses on high batch sizes:

The nano-PEARL implementation was specifically motivated by the lack of high batch-size evaluation in baseline approaches. Our focus is on demonstrating PEARL's advantages at scale (batch_size=32 and above) where traditional methods struggle.

How can I compare with vanilla Speculative Decoding?

Recommended approach: Use the external nano-vllm-spec repository for vanilla SpS benchmarking.

🔗 Repository: nano-vllm-spec

ℹ️
Note: nano-PEARL does not implement traditional Speculative Sampling (SpS) as it uses a fundamentally different parallel architecture. For apples-to-apples comparison with vanilla methods, please use the dedicated nano-vllm-spec implementation.

📚 Additional Resources

I have more questions. Where can I get help?

Community support:

When reporting issues, please include:

  1. Your hardware configuration (GPU model, number of GPUs)
  2. Model sizes (draft and target)
  3. Batch size and other relevant parameters
  4. Error messages or unexpected behavior description