Introduction • nano-PEARL

🚀Overview

Speculative decoding (SD) is a powerful technique for accelerating Large Language Model (LLM) inference. We are excited to introduce nano-PEARL, a single-node, multi-GPU parallel speculative decoding engine built for practical, high-throughput deployment and fast research iteration.

nano-PEARL's core design revolves around Draft-Target Disaggregation (DT): it separates Draft and Target models onto dedicated device groups and runs them concurrently. By integrating production-grade accelerators—Prefix KV Cache, Paged/Flash Attention, CUDA Graphs, and Tensor Parallelism—it delivers exceptional throughput under heavy batch workloads without sacrificing output fidelity.

🎉Latest News

[2025.11] 🔥 We release some comparisons between EAGLE-3 and nano-PEARL! Check them at our benchmark page.
[2025.11] 🔥 We Support Non-2-Power TP size to enable better GPU utilization!
[2025.11] We release more benchmark results of nano-PEARL on NVIDIA L40S!
[2025.11] Our web page of nano-PEARL is established!
[2025.10] We release the source code of nano-PEARL. Any PR is warmly welcomed!
Paper Collection: We are seeking papers that follow the parallel speculative decoding paradigm! We are preparing a collection for more comprehensive understanding.
Coming Soon: More updates and features are in development!

❓Why a New Parallel SD Engine?

While speculative decoding has emerged as a breakthrough for accelerating LLM inference, the lack of robust, deployment-ready implementations has significantly hindered its adoption for real-world workloads. Many existing projects and papers demonstrate impressive speedups, but often suffer from critical limitations: they rely on idealized `bs=1` (batch size 1) demos that fail to represent production scenarios, they create new performance bottlenecks under large batches, or they force engineers to abandon essential system optimizations like Paged Attention. This creates a significant gap between algorithmic research and practical, high-throughput deployment. To bridge this gap, we built nano-PEARL—a purpose-built engine designed from the ground up to make parallel speculative decoding a first-class citizen in a modern, production-grade inference stack, ensuring that throughput scales with batch size.

Key Capabilities of nano-PEARL

High-Throughput Batch Inference: Engineered to excel at large batch sizes (bs=32+) and long contexts, not just bs=1 demos.
Native System Integration: Coexists perfectly with modern inference accelerators like Paged Attention, Flash Attention, and CUDA Graphs.
Scalable Parallelism: Leverages Draft-Target Disaggregation and Tensor Parallelism to efficiently scale across multiple GPUs on a single node.
Resource-Efficient: Avoids resource contention and optimizes VRAM usage through Prefix KV Caching and independent model device placement.

✨Key Features of nano-PEARL

Draft-Target Disaggregation

The draft and target models are loaded onto separate, dedicated device groups. This fundamental design choice eliminates resource contention for VRAM and compute. It also allows for independent configuration, such as assigning a different tensor-parallel (TP) size to the large target model (e.g., TP=2) and the small draft model (e.g., TP=1), optimizing resource usage across all available GPUs.

Parallel Inference

Both the draft model and the target model run inference concurrently. The target model performs "on-the-fly" verification of the tokens being generated by the draft model. This parallel execution minimizes serial dependency, keeps all assigned GPUs highly utilized, and ensures the system is always making forward progress rather than waiting for a long draft to complete before verification.

Dynamic TP Size:

We support loading draft / target model with non-2-power TP size (3,6,7). To the best of our knowledge, we are the first one implementing this feature for LLM inference!
⚠️Currently, this is an experimental feature. We implement dynamic TP by padding the parameters, which introduces additional computation and decreases the overall throughput.

Adaptive Draft Length

nano-PEARL supports adaptive lookahead (gamma) to dynamically balance speculation length and acceptance rates. When alignment between the models is good, the draft model is allowed to generate a longer sequence of tokens without interruption. When alignment is poor (i.e., a mismatch is detected), the target model immediately halts the draft model, preventing it from generating "trash tokens" and wasting valuable compute cycles.

Auto-Set Hyper-parameters

To simplify deployment and tuning, the engine can automatically configure optimal operational parameters based on your specific hardware setup. This lowers the barrier to entry, allowing researchers and practitioners to achieve high performance out-of-the-box without extensive manual profiling and configuration.

High Performance

nano-PEARL is built directly on top of a production-grade inference stack. It natively integrates state-of-the-art acceleration kernels and techniques, including CUDA Graphs, Paged Attention, and Flash Attention. Users do not have to trade system performance for an advanced decoding algorithm—they get both simultaneously.

Memory Efficient

By leveraging Prefix KV Caching, nano-PEARL efficiently reuses the KV cache of shared prefixes between requests. This significantly reduces the memory footprint and redundant computation, enabling higher batch sizes, support for longer context lengths, and overall improved system throughput.

📊Experiments & Results

Our evaluations demonstrate nano-PEARL's strength in practical, high-batch scenarios, moving beyond idealized bs=1 metrics. The charts below contrast performance at bs=32 (a realistic production load) with bs=1 (a common academic baseline) on NVIDIA H200 hardware.

nano-PEARL benchmark results at batch size 32 — **Fig 1:** Performance at Batch Size 32 (HumanEval)

nano-PEARL benchmark results at batch size 1 — **Fig 2:** Performance at Batch Size 1 (GSM8K)

Beyond `bs=1`

In the bs=1 scenario (Fig 2), nano-PEARL achieves massive relative speedups—up to 6.27x for Llama-3.1-70B. And for bs=32, Fig 1 highlight the most critical advantage for production. While standard AR decoding gains some throughput from batching (1,159.17 tok/s), it still faces a fundamental ceiling. nano-PEARL's parallel architecture shatters this ceiling by unlocking the true potential of batched inference. By running the draft and target models concurrently, it achieves a massive 3,546.72 tok/s (3.06x) on the Llama-3.1-70B pair.

This demonstrates nano-PEARL's core design goal: it successfully overcomes the limitations of "toy demos" and delivers substantial, practical throughput gains under the high-batch, high-concurrency loads required by real-world services.

For more results and comparisons on different hardware (like L40S) and datasets, please see the Benchmarks page.

🚀Get Started

nano-PEARL is built on nano-vLLM and shares a similar API, making it easy to adopt. Get started with installation and examples:

Quick Start — Install and run.
Benchmarks — See performance tables.
FAQs — Troubleshooting (OOM/compat).
GitHub — Star, issues, and PRs.

📋Roadmap

We welcome community proposals to push advanced SD to larger clusters, embrace MoE targets, or integrate training-time tools.

Dynamic TP Size: Support for dynamic TP sizes (e.g., TP=6/7).
Draft Model Temperature: Support for non-zero temperature.
Continuous Batching: Integration and chunked prefill.
Advanced Adaptive Gamma: Dynamic gamma tuning.
PEARL-2: Support for fine-tuning a PEARL-specific drafter.

🙏Acknowledgements

This project is built on the excellent foundations of nano-vllm and inspired by the PEARL paper. The nano-PEARL logo was designed by Veo 3.

nano-PEARL: Unleashing Batch Throughput