KV Cache Architecture
The Key-Value (KV) Cache stores the K and V tensors of previous tokens to prevent recomputing them entirely during autoregressive generation.
Source: mortalapps.com- The Key-Value (KV) Cache stores the K and V tensors of previous tokens to prevent recomputing them entirely during autoregressive generation.
- KV Cache memory scales linearly with sequence length but accumulates aggressively, often exceeding model weight footprints at extreme scales.
- Precise memory calculations rely on Layers, Batch Size, KV Heads, Head Dimension, Sequence Length, and Precision bytes.
- Advanced scaling relies heavily on structural compression, low-precision quantization (FP8/INT8), and dynamic block paging.
Why This Matters
Autoregressive decoding generates text strictly one token at a time. Without caching, generating token requires recalculating the attention matrices for all preceding
tokens, resulting in a crippling
computational time complexity for the whole generation sequence. The KV cache trades memory for compute, dropping the time complexity to
. However, for modern workloads extending to 128,000 tokens, this trade-off hits hard physical VRAM limits rapidly, necessitating intense infrastructural optimization.
Core Intuition
Think of decoding a sentence as writing a book chapter by chapter. Instead of rereading all previous chapters every time you want to write a new word (which takes forever), you write a detailed, standardized summary (KV tensors) of each chapter and place it in a binder. To write the next word, you only look at your current thought (Q) and reference your binder (K, V). The binder, however, gets incredibly thick. Managing the physical space of that binder, optimizing the font size (quantization), and tearing out blank pages (paging) is the essence of KV cache engineering.
Technical Deep Dive
The size of the KV cache is dictated by a strict mathematical formula: KV_Size = 2 * L * B * G * D * S * bytes_per_element. For a model like Llama 3 70B (80 layers, 8 KV heads, 128 head dimension) operating at BF16 precision (2 bytes), a single token consumes 2 * 80 * 8 * 128 * 2 = 327,680 bytes (roughly 0.32 MB). At 131,072 tokens (128K context) for a single user, this demands ~42.9 GB. This static allocation is further complicated by the decoding process. As the sequence grows, the KV cache tensors are fundamentally dynamic. In standard PyTorch, this leads to fragmentation and over-allocation, as dynamic tensor growth requires contiguous memory blocks.