Cache-Aware Scheduling Policies
Standard round-robin load balancers destroy prefix caching efficiency by scattering identical requests across random hardware.
Source: mortalapps.com- Standard round-robin load balancers destroy prefix caching efficiency by scattering identical requests across random hardware.
- Cache-aware routers intercept queries, hash their prefixes, and route them to the specific GPU node holding the cached state.
- Utilizes an approximate Radix tree at the gateway level.
- Multiplies cache hit rates and massively boosts cluster throughput.
Why This Matters
If a user queries a massive agentic system prompt 10 times, and a standard load balancer sends those to 10 different GPU instances, the cluster computes and stores the exact same 10GB KV cache 10 distinct times. This destroys global memory efficiency. By directing traffic intelligently, overall cluster throughput can be doubled for free.
Core Intuition
If a customer calls a helpline and asks for "Agent Smith," the switchboard shouldn't route them to a random agent who has to spend 10 minutes reading their file. The switchboard identifies the request and routes it directly to Agent Smith, who already has the file open (cached) on their desk, allowing instant resolution.
Technical Deep Dive
The gateway router maintains a lightweight, asynchronous "approximate replica" of the physical Radix Trees located on all downstream worker nodes. When a request arrives, the router inspects the prompt and calculates a stable cryptographic hash (e.g., using blake2b on a sliding window of the first 256 or 4000 tokens). The router queries its approximate proxy tree to calculate prefix match rates across the cluster. It abandons standard round-robin distribution and forces the request directly to the worker Data Parallel (DP) rank possessing the highest KV cache overlap.