LLM Inference Systems

Chunked Prefill Processing

Chunked prefill solves the latency spikes caused by mixing compute-heavy prefills with memory-bound decodes on the same GPU.

Published June 1, 2026 · By MortalApps · 3 min read · ~549 words

TL;DR

Chunked prefill solves the latency spikes caused by mixing compute-heavy prefills with memory-bound decodes on the same GPU.
It fragments massive input prompts into fixed-size chunks (e.g., 2,048 tokens).
These chunks are batched with ongoing decode requests in "decode-maximal batching" to flatten compute time.
Achieves "stall-free scheduling," heavily stabilizing Time-Between-Tokens (TBT).

Why This Matters Intuition Deep Dive Takeaways Related

Why This Matters

If a cluster lacks the hardware scale to run full Disaggregated Prefill-Decode architectures, prefill and decode must coexist. Without chunking, processing a 10k-token prompt blocks the GPU for hundreds of milliseconds, causing severe stutters in all concurrent streams. Chunked prefill provides software-level phase separation.

Core Intuition

Instead of forcing the entire factory to halt while you process a massive, singular shipment (Prefill), you break the shipment into smaller, bite-sized pallets. You process one pallet per cycle alongside your normal lightweight background tasks (Decode). The factory flow remains perfectly smooth, and no background task is ever paused for long.

Technical Deep Dive

Formulated in Sarathi-Serve, chunked prefill utilizes a strict token budget (e.g., max_num_batched_tokens = 8192). If a,000-token prompt arrives, it is split into chunks of ~2,000 tokens. The scheduler creates a "hybrid batch": it takes exactly one chunk (providing just enough arithmetic density to saturate the Tensor Cores) and fills all remaining VRAM slots with decode requests. Because the SMs are already heavily fetching weights to compute the prefill chunk, the decode sequences essentially "piggyback" on the memory pull, generating tokens at virtually zero marginal latency cost.

Key Takeaways

Colocating massive prefills and decodes causes severe latency spikes.

Chunked prefill splits the prompt across the token dimension.

Creates stall-free schedules by coalescing prefill chunks with ongoing decodes.

Sacrifices minor TTFT for massive improvements in Time-Between-Tokens (TBT) stability and total serving capacity.

Why This Matters

Core Intuition

Technical Deep Dive

Key Takeaways

Internals / Execution Flow

Performance Implications

Real-World Usage

Common Bottlenecks

Optimization Strategies

Tools & Frameworks

Interview Guidance

Related Concepts