KV Cache Memory Management
The Key-Value (KV) cache stores past token representations during autoregressive LLM generation to prevent redundant sequence recalculation.
Source: mortalapps.com- The Key-Value (KV) cache stores past token representations during autoregressive LLM generation to prevent redundant sequence recalculation.
- The core purpose is optimizing inference speed, trading a massive memory footprint for a dramatic reduction in required FLOPs.
- The primary optimization idea is PagedAttention, which partitions the KV cache into fixed-size blocks (pages), eliminating massive fragmentation waste.
- The most important engineering insight is that treating GPU memory like an Operating System's Virtual Memory paging system yields 14x-24x throughput scaling for LLM serving.
Why This Matters
In production LLM serving, the KV cache grows dynamically and unpredictably per request. A single sequence in a 13B parameter model can easily consume 1.7GB of VRAM. Traditional static allocation assumes the maximum possible sequence length upfront, wasting 60% to 80% of VRAM due to internal fragmentation and over-reservation. Solving this memory crisis directly defines the maximum concurrent batch size a node can handle, and thus, directly dictates serving profitability.
Core Intuition
Without paging, a restaurant reserves a massive 100-seat table for every customer who walks in, just in case they bring 99 friends. The restaurant (GPU) quickly fills up with only 4 people, turning away business. PagedAttention puts customers at efficient 4-seat tables (blocks). If more friends arrive, the waiter simply points them to another 4-seat table across the room. The system tracks who belongs to which party via a master ledger (the block table).
Technical Deep Dive
vLLM introduced PagedAttention to resolve this fundamental architectural flaw. The KV cache is divided into fixed-size physical blocks (typically 16 tokens, which equates to ~12.8KB for a 13B model). The architecture utilizes three components:
Logical Blocks: Represent the linear, contiguous sequence of tokens for a request.
Physical Blocks: Distributed non-contiguously in the GPU's physical VRAM.
Block Table: A CPU-managed software mapping that translates logical blocks to physical blocks. During generation, memory is allocated on-demand, block by block. This virtually eliminates over-reservation and limits internal fragmentation to less than the size of a single block (e.g., tokens of waste per sequence).