⚡ Performance & Configuration
Why do I see different token counts when testing the same dataset? Performance
Short answer: This is expected behavior in PEARL benchmarking.
Detailed explanation:
For PEARL benchmarking, we run a fixed number of steps rather than a fixed token output. This design choice ensures that PEARL maintains a consistent batch size throughout the entire execution.
Getting assertion errors at batch-size=32 on CNN/DM datasets? Config
Solution: Increase the max_num_batched_tokens parameter in your configuration.
Steps to fix:
- Open
pearl_config.py - Set
max_num_batched_tokens = 65536 - Restart your benchmark
# In pearl_config.py
config = PEARLConfig(
draft_model_path="...",
target_model_path="...",
max_num_batched_tokens=65536, # Increase this value
...
)
Why is my speedup only 0.95x at batch-size=32 while smaller batches show 1.36x-2.8x? Performance
Root cause: This is typically due to compute-bound limitations at higher batch sizes, especially on certain GPU architectures like H20.
Recommended solutions:
- Use larger draft models: Switch from 1.5B to 3B-8B parameter draft models
- Reduce gamma values: Larger draft models typically produce better quality drafts, reducing the gamma overhead
- Improve acceptance rates: Better draft models lead to higher acceptance rates and better overall throughput
📊 Terminology & Metrics
What does "AR" mean in the benchmark results?
AR stands for Auto-Regressive decoding.
In the context of nano-PEARL benchmarks, AR is used as a comparison baseline to calculate PEARL's speedup ratio. It represents the standard sequential token generation approach where each token is generated one after another, without speculation.
The speedup ratio is calculated as: Speedup = PEARL_throughput / AR_throughput
What is the MAT metric?
MAT stands for Mean Accept Tokens.
This metric originates from the original PEARL paper and indicates the equivalent vanilla speculative decoding performance. It measures the average number of tokens accepted per speculation round.
- Higher MAT: Better alignment between draft and target models
- Lower MAT: More rejections, indicating poor draft quality
MAT is a key indicator of how well your draft model is performing and whether your speculation strategy is effective.
🏗️ Architecture & Design
Can DT Disaggregation and Parallel Inference work independently? Architecture
No, they are the core of PEARL and cannot operate independently.
Why they're interdependent:
PEARL's architecture fundamentally relies on overlapping draft model computation with target model verification through "pre-verify" and "post-verify" mechanisms. These features work together to:
- Reduce draft model runtime overhead
- Maintain throughput at larger batch sizes
- Enable adaptive draft length adjustments
What are the "pre-verify" and "post-verify" mechanisms? Architecture
These are PEARL's core coordination mechanisms between the draft and target models:
- Pre-verify: The target model can interrupt the draft model when alignment is poor, preventing the generation of low-quality draft tokens
- Post-verify: When alignment is good, the draft model continues generating without interruption, maximizing speculation efficiency
Combined effect: This creates the adaptive draft length feature - automatically adjusting how many tokens to speculate based on real-time alignment quality.
Where can I find detailed ablation studies? Architecture
Primary source: The original PEARL paper contains comprehensive ablation studies.
📄 Paper reference: PEARL: Parallel Speculative Decoding with Adaptive Draft Length
Why nano-PEARL focuses on high batch sizes:
The nano-PEARL implementation was specifically motivated by the lack of high batch-size evaluation in baseline approaches. Our focus is on demonstrating PEARL's advantages at scale (batch_size=32 and above) where traditional methods struggle.
How can I compare with vanilla Speculative Decoding?
Recommended approach: Use the external nano-vllm-spec repository for vanilla SpS benchmarking.
🔗 Repository: nano-vllm-spec
📚 Additional Resources
I have more questions. Where can I get help?
Community support:
- 🐛 Open an issue: GitHub Issues
- 💬 Join discussions: GitHub Discussions
- 📖 Read the paper: PEARL on arXiv
When reporting issues, please include:
- Your hardware configuration (GPU model, number of GPUs)
- Model sizes (draft and target)
- Batch size and other relevant parameters
- Error messages or unexpected behavior description