LLM Inference Systems

Prefill vs Decode Architecture

LLM inference is mathematically divided into a compute-bound Prefill phase and a memory-bound Decode phase.

Published June 1, 2026 · By MortalApps · 3 min read · ~532 words

TL;DR

LLM inference is mathematically divided into a compute-bound Prefill phase and a memory-bound Decode phase.
Prefill utilizes dense matrix multiplication (GEMM); Decode utilizes low-intensity matrix-vector multiplication (GEMV).
Colocating these phases on the same GPU causes catastrophic resource interference.
Disaggregated architectures isolate these phases onto specialized hardware pools.

Why This Matters Intuition Deep Dive Takeaways Related

Why This Matters

When a large batch of ongoing decode requests (generating 1 token per microsecond) is suddenly joined by a new request requiring a,000-token prefill, the GPU locks up computing the prefill GEMM. The decode sequences stall, missing their inter-token latency SLAs. Understanding this phase imbalance is the absolute foundation of large-scale serving architecture.

Core Intuition

Prefill is like reading an entire book to understand the context (reading all input tokens at once). Decode is like writing the sequel one word at a time, having to recall the entire context for every single new word. Reading is fast and parallelized; writing is slow, sequential, and heavily bottlenecked by memory retrieval.

Technical Deep Dive

During Prefill, the attention mechanism computes across all input tokens simultaneously. The arithmetic intensity is high, fully saturating the Tensor Cores. During Decode, . The entire multi-gigabyte weight matrix must be fetched from HBM to compute just one token, leaving the SMs mostly idle. Frameworks like DistServe 6 and Splitwise 7 address this by physically severing the engine: Prefill-only nodes crunch the heavy GEMMs and transmit the resulting KV cache over the network to Decode-only nodes that handle the GEMVs.

Key Takeaways

Prefill is compute-bound (GEMM); Decode is memory-bandwidth bound (GEMV).

Mixing them on a single GPU causes severe latency spikes for decode sequences.

Disaggregated architectures physically separate the tasks onto different GPUs.

KV Cache transfer speed is the limiting factor of disaggregated architectures.

Why This Matters

Core Intuition

Technical Deep Dive

Key Takeaways

Internals / Execution Flow

Performance Implications

Real-World Usage

Common Bottlenecks

Optimization Strategies

Tools & Frameworks

Interview Guidance

Related Concepts