← Infrastructure GPU Memory Systems
Infrastructure

KV Cache Memory Management

The Key-Value (KV) cache stores past token representations during autoregressive LLM generation to prevent redundant sequence recalculation.

Source: mortalapps.com
TL;DR
  • The Key-Value (KV) cache stores past token representations during autoregressive LLM generation to prevent redundant sequence recalculation.
  • The core purpose is optimizing inference speed, trading a massive memory footprint for a dramatic reduction in required FLOPs.
  • The primary optimization idea is PagedAttention, which partitions the KV cache into fixed-size blocks (pages), eliminating massive fragmentation waste.
  • The most important engineering insight is that treating GPU memory like an Operating System's Virtual Memory paging system yields 14x-24x throughput scaling for LLM serving.

Why This Matters

In production LLM serving, the KV cache grows dynamically and unpredictably per request. A single sequence in a 13B parameter model can easily consume 1.7GB of VRAM. Traditional static allocation assumes the maximum possible sequence length upfront, wasting 60% to 80% of VRAM due to internal fragmentation and over-reservation. Solving this memory crisis directly defines the maximum concurrent batch size a node can handle, and thus, directly dictates serving profitability.

Core Intuition

Without paging, a restaurant reserves a massive 100-seat table for every customer who walks in, just in case they bring 99 friends. The restaurant (GPU) quickly fills up with only 4 people, turning away business. PagedAttention puts customers at efficient 4-seat tables (blocks). If more friends arrive, the waiter simply points them to another 4-seat table across the room. The system tracks who belongs to which party via a master ledger (the block table).

Technical Deep Dive

vLLM introduced PagedAttention to resolve this fundamental architectural flaw. The KV cache is divided into fixed-size physical blocks (typically 16 tokens, which equates to ~12.8KB for a 13B model). The architecture utilizes three components:

Logical Blocks: Represent the linear, contiguous sequence of tokens for a request.

Physical Blocks: Distributed non-contiguously in the GPU's physical VRAM.

Block Table: A CPU-managed software mapping that translates logical blocks to physical blocks. During generation, memory is allocated on-demand, block by block. This virtually eliminates over-reservation and limits internal fragmentation to less than the size of a single block (e.g., tokens of waste per sequence).

Key Takeaways

KV Cache scales dynamically per token, causing massive fragmentation under legacy contiguous allocation schemes.
PagedAttention partitions the cache into blocks mapped via a Block Table, enabling OS-style virtual memory paging in software.
Block-based caching enables complex memory sharing patterns, drastically reducing the cost of system prompts.
CUDA VMM APIs offer a hardware-native alternative to software PagedAttention by mapping scattered physical 2MB pages into contiguous virtual arrays.