← Infrastructure LLM Inference Systems
Infrastructure

Multi-turn Context Sharing

Recomputing the KV cache for identical system prompts across multiple user sessions wastes massive compute.

Source: mortalapps.com
TL;DR
  • Recomputing the KV cache for identical system prompts across multiple user sessions wastes massive compute.
  • RadixAttention maps the KV cache to a globally managed Radix Tree structure residing in GPU memory.
  • Yields 75-95% cache hit rates on multi-turn conversations, eliminating redundant prefill operations.

Why This Matters

Complex AI agents utilize massive 10k-token system prompts and tool definitions. In a 50-turn conversation, computing that static 10k-prefix 50 times causes an exponential throughput collapse ( complexity). Context sharing reduces this repetitive compute to zero, multiplying overall serving capacity by up to 5x.

Core Intuition

If,000 customers ask a question starting with "According to the company policy document...", the inference engine should read and memorize the policy document exactly once, store those memories in a public library, and let all,000 customers borrow the exact same memory state simultaneously.

Technical Deep Dive

Implemented natively in SGLang, RadixAttention treats the KV cache not as a linear buffer, but as a heavily optimized Radix Tree. When a request arrives, match_prefix() executes a longest-prefix-match against the tree. If found, the engine retrieves the exact physical device indices of the cached tokens. The ReqToTokenPool allocator strictly provisions memory only for the uncached suffix. The attention backend (e.g., FlashInfer) processes the new suffix while securely referencing the read-only shared prefix indices.

Key Takeaways

Standard architectures redundantly compute identical prefixes, destroying throughput.
RadixAttention uses a global Radix Tree to share physical KV memory across discrete requests.
Flash attention backends execute the suffix while passing pointers to the shared prefix.
Strict reference counting and LRU eviction are necessary to prevent HBM out-of-memory faults.