Multi-turn Context Sharing
Recomputing the KV cache for identical system prompts across multiple user sessions wastes massive compute.
Source: mortalapps.com- Recomputing the KV cache for identical system prompts across multiple user sessions wastes massive compute.
- RadixAttention maps the KV cache to a globally managed Radix Tree structure residing in GPU memory.
- Yields 75-95% cache hit rates on multi-turn conversations, eliminating redundant prefill operations.
Why This Matters
Complex AI agents utilize massive 10k-token system prompts and tool definitions. In a 50-turn conversation, computing that static 10k-prefix 50 times causes an exponential throughput collapse ( complexity). Context sharing reduces this repetitive compute to zero, multiplying overall serving capacity by up to 5x.
Core Intuition
If,000 customers ask a question starting with "According to the company policy document...", the inference engine should read and memorize the policy document exactly once, store those memories in a public library, and let all,000 customers borrow the exact same memory state simultaneously.
Technical Deep Dive
Implemented natively in SGLang, RadixAttention treats the KV cache not as a linear buffer, but as a heavily optimized Radix Tree. When a request arrives, match_prefix() executes a longest-prefix-match against the tree. If found, the engine retrieves the exact physical device indices of the cached tokens. The ReqToTokenPool allocator strictly provisions memory only for the uncached suffix. The attention backend (e.g., FlashInfer) processes the new suffix while securely referencing the read-only shared prefix indices.