Prefix Tree KV Cache Eviction
When the GPU memory fills up, the Radix Tree managing the KV Cache must evict nodes to make room for new tokens without destroying valuable context.
Source: mortalapps.com- When the GPU memory fills up, the Radix Tree managing the KV Cache must evict nodes to make room for new tokens without destroying valuable context.
- SGLang tracks active usage via a lock_ref reference counter on each tree node, protecting in-use memory from being pruned.
- Default Least Recently Used (LRU) policies can accidentally evict highly valuable system prompts, causing severe latency spikes upon recomputation.
- Advanced implementations use Priority-based eviction, protecting shared prefixes (high priority) while aggressively pruning ephemeral user queries and dead reasoning branches.
Why This Matters
Under massive API load, KV cache memory is entirely saturated almost instantly. The decision of what to delete dictates the entire latency profile of the inference engine. If an engine evicts a,000-token system prompt used by 100 concurrent agents just because it wasn't the most recently used block, the next request requires an massive, unnecessary prefill compute penalty. Intelligent eviction transforms the KV cache from a dumb FIFO memory buffer into a highly tuned hierarchical state machine.
Core Intuition
Imagine a library running out of shelf space. LRU eviction is like throwing away the book that hasn't been checked out recently. However, if the library throws away the master index catalog just because no one looked at it in the last five minutes, the entire library halts when someone needs it. Reference counting acts as a "currently in use" tag, and Priority scheduling acts as a "do not destroy" tag for the master catalog. By prioritizing the eviction of user-specific temporary thoughts over shared foundational instructions, the system maintains high throughput.
Technical Deep Dive
SGLang maintains a lock_ref counter on each TreeNode. The method inc_lock_ref increments along the path to the root. Any node with lock_ref > 0 is actively being queried or generated against and is completely protected from eviction. Conceptually, nodes are grouped into zones: Zone 1 (shared, lock_ref > 1) represents system prompts, while Zone 2 (private, lock_ref <= 1) represents user-specific context. When serving reasoning models (like DeepSeek-R1), generated <think> tokens are inserted into the radix tree. In multi-turn chats, the user drops the thinking tokens in subsequent prompts. This strands the thinking KV entries as "dead branches" that will never be matched again, permanently wasting gigabytes of VRAM until an eviction policy finally cleans them up.