← Infrastructure Transformer Systems
Infrastructure

Long-Context Inference Scaling

Long-context inference scaling separates and optimizes the intensely compute-bound Prefill phase from the memory-bound Decode phase.

Source: mortalapps.com
TL;DR
  • Long-context inference scaling separates and optimizes the intensely compute-bound Prefill phase from the memory-bound Decode phase.
  • Chunked Prefill splits massive prompts into smaller segments and batches them with decoding requests, hiding memory latency behind compute intensity.
  • Disaggregated Prefill/Decode places prefill and decoding on entirely separate GPU instances, allowing independent optimization for TTFT and TPOT.
  • KV Cache is transferred over ultra-fast interconnects via NVIDIA NIXL (over RDMA/NVMe) to stitch the phases back together efficiently.

Why This Matters

As contexts scale past 100K tokens, the prefill phase takes seconds to compute. In a traditional unified serving system, a single massive prefill request monopolizes the GPU, pausing all other concurrent decoding requests and causing unacceptable stutters in generation (high tail latency). Scaling inference means breaking this interference, maximizing throughput (via large decode batches) without sacrificing interactivity (Time-To-First-Token).

Core Intuition

Imagine a restaurant kitchen. Prefilling is like roasting a 20-pound turkey (takes a long time, massive compute). Decoding is like plating salads (fast, but wait-staff bound). Chunked Prefill means the chef roasts the turkey in slices, plating a salad between roasting each slice. The salads keep moving, and the turkey finishes smoothly. Disaggregated Inference means you build two separate kitchens. Kitchen A only roasts turkeys. Kitchen B only plates salads. When Kitchen A finishes a turkey, it slides it over a high-speed conveyor belt (NIXL KV Transfer) to Kitchen B for plating.

Technical Deep Dive

Chunked Prefill divides a massive 12K token prompt into manageable chunks (e.g., 2K tokens). The scheduler creates a 1D query layout, batching 2K prefill tokens alongside several single-token decode requests. Because decoding underutilizes Tensor Cores (it is waiting on memory bandwidth), the compute-heavy prefill chunk utilizes the idle Tensor Cores, achieving near-perfect hardware utilization without causing generation stalls. Disaggregated Inference takes this further. P-Heavy (Prefill) instances optimize for TTFT using high Tensor Parallelism. D-Heavy (Decode) instances optimize for TPOT using Data Parallelism and large PagedAttention batch sizes. The computed KV Cache must move from P to D. NVIDIA NIXL operates over RDMA/InfiniBand at wire speed (e.g., 400 Gbps), decoupling the transfer from the critical path of model execution via asynchronous send/recv operations.

Key Takeaways

Massive prompts cause "generation stalls" in traditional unified serving engines.
Chunked prefill batches small segments of prompts with decodes to hide memory latency and utilize Tensor Cores.
Disaggregated inference physically splits Prefill (TTFT optimization) and Decode (TPOT optimization) across network boundaries.
NVIDIA NIXL provides the RDMA transport layer to move massive KV caches between GPU instances at wire speed.