← Infrastructure LLM Inference Systems
Infrastructure

Cache-Aware Scheduling Policies

Standard round-robin load balancers destroy prefix caching efficiency by scattering identical requests across random hardware.

Source: mortalapps.com
TL;DR
  • Standard round-robin load balancers destroy prefix caching efficiency by scattering identical requests across random hardware.
  • Cache-aware routers intercept queries, hash their prefixes, and route them to the specific GPU node holding the cached state.
  • Utilizes an approximate Radix tree at the gateway level.
  • Multiplies cache hit rates and massively boosts cluster throughput.

Why This Matters

If a user queries a massive agentic system prompt 10 times, and a standard load balancer sends those to 10 different GPU instances, the cluster computes and stores the exact same 10GB KV cache 10 distinct times. This destroys global memory efficiency. By directing traffic intelligently, overall cluster throughput can be doubled for free.

Core Intuition

If a customer calls a helpline and asks for "Agent Smith," the switchboard shouldn't route them to a random agent who has to spend 10 minutes reading their file. The switchboard identifies the request and routes it directly to Agent Smith, who already has the file open (cached) on their desk, allowing instant resolution.

Technical Deep Dive

The gateway router maintains a lightweight, asynchronous "approximate replica" of the physical Radix Trees located on all downstream worker nodes. When a request arrives, the router inspects the prompt and calculates a stable cryptographic hash (e.g., using blake2b on a sliding window of the first 256 or 4000 tokens). The router queries its approximate proxy tree to calculate prefix match rates across the cluster. It abandons standard round-robin distribution and forces the request directly to the worker Data Parallel (DP) rank possessing the highest KV cache overlap.

Key Takeaways

Standard load balancers destroy prefix cache hit rates in LLM clusters.
Cache-aware routers use stable hashing to identify prompts and direct traffic to the hardware already holding the cache.
Routers maintain lightweight, approximate proxy trees to prevent synchronous blocking.
Advanced pooling (TokenLake) is required to prevent "heavy hitter" prompts from overloading single nodes.