Prefill vs Decode Architecture
LLM inference is mathematically divided into a compute-bound Prefill phase and a memory-bound Decode phase.
Source: mortalapps.com- LLM inference is mathematically divided into a compute-bound Prefill phase and a memory-bound Decode phase.
- Prefill utilizes dense matrix multiplication (GEMM); Decode utilizes low-intensity matrix-vector multiplication (GEMV).
- Colocating these phases on the same GPU causes catastrophic resource interference.
- Disaggregated architectures isolate these phases onto specialized hardware pools.
Why This Matters
When a large batch of ongoing decode requests (generating 1 token per microsecond) is suddenly joined by a new request requiring a,000-token prefill, the GPU locks up computing the prefill GEMM. The decode sequences stall, missing their inter-token latency SLAs. Understanding this phase imbalance is the absolute foundation of large-scale serving architecture.
Core Intuition
Prefill is like reading an entire book to understand the context (reading all input tokens at once). Decode is like writing the sequel one word at a time, having to recall the entire context for every single new word. Reading is fast and parallelized; writing is slow, sequential, and heavily bottlenecked by memory retrieval.
Technical Deep Dive
During Prefill, the attention mechanism computes across all
input tokens simultaneously. The arithmetic intensity is high, fully saturating the Tensor Cores. During Decode,
. The entire multi-gigabyte weight matrix must be fetched from HBM to compute just one token, leaving the SMs mostly idle. Frameworks like DistServe 6 and Splitwise 7 address this by physically severing the engine: Prefill-only nodes crunch the heavy GEMMs and transmit the resulting KV cache over the network to Decode-only nodes that handle the GEMVs.