← Infrastructure LLM Inference Systems
Infrastructure

Prefill vs Decode Architecture

LLM inference is mathematically divided into a compute-bound Prefill phase and a memory-bound Decode phase.

Source: mortalapps.com
TL;DR
  • LLM inference is mathematically divided into a compute-bound Prefill phase and a memory-bound Decode phase.
  • Prefill utilizes dense matrix multiplication (GEMM); Decode utilizes low-intensity matrix-vector multiplication (GEMV).
  • Colocating these phases on the same GPU causes catastrophic resource interference.
  • Disaggregated architectures isolate these phases onto specialized hardware pools.

Why This Matters

When a large batch of ongoing decode requests (generating 1 token per microsecond) is suddenly joined by a new request requiring a,000-token prefill, the GPU locks up computing the prefill GEMM. The decode sequences stall, missing their inter-token latency SLAs. Understanding this phase imbalance is the absolute foundation of large-scale serving architecture.

Core Intuition

Prefill is like reading an entire book to understand the context (reading all input tokens at once). Decode is like writing the sequel one word at a time, having to recall the entire context for every single new word. Reading is fast and parallelized; writing is slow, sequential, and heavily bottlenecked by memory retrieval.

Technical Deep Dive

During Prefill, the attention mechanism computes across all input tokens simultaneously. The arithmetic intensity is high, fully saturating the Tensor Cores. During Decode, . The entire multi-gigabyte weight matrix must be fetched from HBM to compute just one token, leaving the SMs mostly idle. Frameworks like DistServe 6 and Splitwise 7 address this by physically severing the engine: Prefill-only nodes crunch the heavy GEMMs and transmit the resulting KV cache over the network to Decode-only nodes that handle the GEMVs.

Key Takeaways

Prefill is compute-bound (GEMM); Decode is memory-bandwidth bound (GEMV).
Mixing them on a single GPU causes severe latency spikes for decode sequences.
Disaggregated architectures physically separate the tasks onto different GPUs.
KV Cache transfer speed is the limiting factor of disaggregated architectures.