← Infrastructure Transformer Systems
Infrastructure

Prefix Tree KV Cache Eviction

When the GPU memory fills up, the Radix Tree managing the KV Cache must evict nodes to make room for new tokens without destroying valuable context.

Source: mortalapps.com
TL;DR
  • When the GPU memory fills up, the Radix Tree managing the KV Cache must evict nodes to make room for new tokens without destroying valuable context.
  • SGLang tracks active usage via a lock_ref reference counter on each tree node, protecting in-use memory from being pruned.
  • Default Least Recently Used (LRU) policies can accidentally evict highly valuable system prompts, causing severe latency spikes upon recomputation.
  • Advanced implementations use Priority-based eviction, protecting shared prefixes (high priority) while aggressively pruning ephemeral user queries and dead reasoning branches.

Why This Matters

Under massive API load, KV cache memory is entirely saturated almost instantly. The decision of what to delete dictates the entire latency profile of the inference engine. If an engine evicts a,000-token system prompt used by 100 concurrent agents just because it wasn't the most recently used block, the next request requires an massive, unnecessary prefill compute penalty. Intelligent eviction transforms the KV cache from a dumb FIFO memory buffer into a highly tuned hierarchical state machine.

Core Intuition

Imagine a library running out of shelf space. LRU eviction is like throwing away the book that hasn't been checked out recently. However, if the library throws away the master index catalog just because no one looked at it in the last five minutes, the entire library halts when someone needs it. Reference counting acts as a "currently in use" tag, and Priority scheduling acts as a "do not destroy" tag for the master catalog. By prioritizing the eviction of user-specific temporary thoughts over shared foundational instructions, the system maintains high throughput.

Technical Deep Dive

SGLang maintains a lock_ref counter on each TreeNode. The method inc_lock_ref increments along the path to the root. Any node with lock_ref > 0 is actively being queried or generated against and is completely protected from eviction. Conceptually, nodes are grouped into zones: Zone 1 (shared, lock_ref > 1) represents system prompts, while Zone 2 (private, lock_ref <= 1) represents user-specific context. When serving reasoning models (like DeepSeek-R1), generated <think> tokens are inserted into the radix tree. In multi-turn chats, the user drops the thinking tokens in subsequent prompts. This strands the thinking KV entries as "dead branches" that will never be matched again, permanently wasting gigabytes of VRAM until an eviction policy finally cleans them up.

Key Takeaways

SGLang RadixCache utilizes reference counting (lock_ref) to protect active nodes from eviction.
Default LRU policies are insufficient for agentic workloads and multi-turn reasoning workflows.
Unused reasoning tokens (<think>) create "dead branches" that rapidly and permanently exhaust VRAM.
Priority-based eviction protects high-value system prompts and accelerates the destruction of ephemeral dead branches.