← Infrastructure Transformer Systems
Infrastructure

RadixAttention and Prefix Reuse

RadixAttention utilizes a prefix tree (radix tree) data structure to automatically detect, store, and reuse common KV cache prefixes across different

Source: mortalapps.com
TL;DR
  • RadixAttention utilizes a prefix tree (radix tree) data structure to automatically detect, store, and reuse common KV cache prefixes across different inference requests.
  • It eliminates massive redundant prefill recomputation for overlapping system prompts, chat histories, or few-shot examples.
  • Memory sharing happens seamlessly at the page granularity; overlapping branches in the tree point to shared physical memory blocks managed by PagedAttention.
  • Pioneered by SGLang, this mechanism drastically improves "goodput" and Time-To-First-Token (TTFT) for complex agentic workflows.

Why This Matters

In real-world LLM deployments (such as customer support bots, coding assistants, and agentic frameworks), a massive percentage of input tokens are identical across queries. System prompts, retrieved document contexts, and multi-turn chat histories are resubmitted repeatedly. Without RadixAttention, the server wastes enormous compute resources performing the prefill computation on the exact same text millions of times. Prefix reuse turns compute-bound redundant work into a near-instant cache lookup, saving massive compute costs.

Core Intuition

A standard cache maps a full request hash to a response. But language is incremental. If Request A is "What is the capital of France?" and Request B is "What is the capital of Germany?", they share the prefix "What is the capital of ". A Radix Tree stores these sequences as branching paths. The server traverses the tree with the new prompt, finds the longest matching path, and immediately pulls the pre-computed KV cache for those tokens. The engine only computes attention for the differing suffix ("Germany?").

Technical Deep Dive

Unlike a standard trie where edges represent single tokens, a radix tree compresses paths where nodes have only one child, labeling edges with token sequences of varying lengths. Each node in the tree maps directly to physical KV cache tensors managed by underlying systems like PagedAttention. To maintain efficiency and prevent fragmentation, prefix matching typically operates at the page granularity (e.g., page_size = 16). If a prefix is matched, the system snaps to the nearest 16-token boundary. Partial pages at the end of sequences are usually not cached to ensure efficient physical memory allocation.

Key Takeaways

RadixAttention uses a radix tree to track and reuse overlapping token sequences automatically.
It maps tree nodes directly to PagedAttention physical blocks, enabling memory sharing.
It eliminates massive redundant prefill compute, drastically lowering Time-To-First-Token.
Schedule ordering (e.g., LPM vs FCFS) fundamentally dictates the cache hit rate and overall system goodput.