Chunked Prefill Processing
Chunked prefill solves the latency spikes caused by mixing compute-heavy prefills with memory-bound decodes on the same GPU.
Source: mortalapps.com- Chunked prefill solves the latency spikes caused by mixing compute-heavy prefills with memory-bound decodes on the same GPU.
- It fragments massive input prompts into fixed-size chunks (e.g., 2,048 tokens).
- These chunks are batched with ongoing decode requests in "decode-maximal batching" to flatten compute time.
- Achieves "stall-free scheduling," heavily stabilizing Time-Between-Tokens (TBT).
Why This Matters
If a cluster lacks the hardware scale to run full Disaggregated Prefill-Decode architectures, prefill and decode must coexist. Without chunking, processing a 10k-token prompt blocks the GPU for hundreds of milliseconds, causing severe stutters in all concurrent streams. Chunked prefill provides software-level phase separation.
Core Intuition
Instead of forcing the entire factory to halt while you process a massive, singular shipment (Prefill), you break the shipment into smaller, bite-sized pallets. You process one pallet per cycle alongside your normal lightweight background tasks (Decode). The factory flow remains perfectly smooth, and no background task is ever paused for long.
Technical Deep Dive
Formulated in Sarathi-Serve, chunked prefill utilizes a strict token budget (e.g., max_num_batched_tokens = 8192). If a,000-token prompt arrives, it is split into chunks of ~2,000 tokens. The scheduler creates a "hybrid batch": it takes exactly one chunk (providing just enough arithmetic density to saturate the Tensor Cores) and fills all remaining VRAM slots with decode requests. Because the SMs are already heavily fetching weights to compute the prefill chunk, the decode sequences essentially "piggyback" on the memory pull, generating tokens at virtually zero marginal latency cost.