← Infrastructure Transformer Systems
Infrastructure

KV Cache Architecture

The Key-Value (KV) Cache stores the K and V tensors of previous tokens to prevent recomputing them entirely during autoregressive generation.

Source: mortalapps.com
TL;DR
  • The Key-Value (KV) Cache stores the K and V tensors of previous tokens to prevent recomputing them entirely during autoregressive generation.
  • KV Cache memory scales linearly with sequence length but accumulates aggressively, often exceeding model weight footprints at extreme scales.
  • Precise memory calculations rely on Layers, Batch Size, KV Heads, Head Dimension, Sequence Length, and Precision bytes.
  • Advanced scaling relies heavily on structural compression, low-precision quantization (FP8/INT8), and dynamic block paging.

Why This Matters

Autoregressive decoding generates text strictly one token at a time. Without caching, generating token requires recalculating the attention matrices for all preceding tokens, resulting in a crippling computational time complexity for the whole generation sequence. The KV cache trades memory for compute, dropping the time complexity to . However, for modern workloads extending to 128,000 tokens, this trade-off hits hard physical VRAM limits rapidly, necessitating intense infrastructural optimization.

Core Intuition

Think of decoding a sentence as writing a book chapter by chapter. Instead of rereading all previous chapters every time you want to write a new word (which takes forever), you write a detailed, standardized summary (KV tensors) of each chapter and place it in a binder. To write the next word, you only look at your current thought (Q) and reference your binder (K, V). The binder, however, gets incredibly thick. Managing the physical space of that binder, optimizing the font size (quantization), and tearing out blank pages (paging) is the essence of KV cache engineering.

Technical Deep Dive

The size of the KV cache is dictated by a strict mathematical formula: KV_Size = 2 * L * B * G * D * S * bytes_per_element. For a model like Llama 3 70B (80 layers, 8 KV heads, 128 head dimension) operating at BF16 precision (2 bytes), a single token consumes 2 * 80 * 8 * 128 * 2 = 327,680 bytes (roughly 0.32 MB). At 131,072 tokens (128K context) for a single user, this demands ~42.9 GB. This static allocation is further complicated by the decoding process. As the sequence grows, the KV cache tensors are fundamentally dynamic. In standard PyTorch, this leads to fragmentation and over-allocation, as dynamic tensor growth requires contiguous memory blocks.

Key Takeaways

KV caching prevents an computational explosion during autoregressive decoding.
KV size scales linearly with tokens, demanding aggressive optimization like GQA, MLA, and Quantization.
Memory bandwidth when reading the cache is the primary bottleneck during decode phases.
FP8 cache quantization is practically "free money" in production, halving VRAM requirements with sub-1% accuracy costs.