SLA-Aware Request Scheduling
Replaces First-Come-First-Served (FCFS) with probabilistic, SLA-targeted scheduling mechanisms.
Source: mortalapps.com- Replaces First-Come-First-Served (FCFS) with probabilistic, SLA-targeted scheduling mechanisms.
- Decouples TTFT urgency logic (Prefill) from Time-Per-Output-Token (TPOT) packing logic (Decode).
- Implements "slack-guided" batching: packing extra requests into the decode batch right up to the strict latency limit.
Why This Matters
In commercial deployments, raw "Goodput" (total tokens per second) is completely meaningless if 20% of requests violate their Service Level Agreement (SLA) timeouts and are dropped by the client API. Commercial engines must be optimized to guarantee 99th-percentile (P99) latency bounds, sacrificing peak theoretical throughput for unyielding stability.
Core Intuition
If you have to deliver 5 packages by 5:00 PM, and it's 4:00 PM, you don't arbitrarily drop off the closest one first (FCFS). You calculate the exact drive time to each destination and schedule the route to ensure the most "in-danger" package arrives at 4:59 PM. You use the "slack" (extra time) to optimize the route.
Technical Deep Dive
Frameworks like Kairos decouple SLO handling into two mechanisms. For the prefill side, it utilizes Urgency-Based Priority Scheduling, leveraging ML regression to continuously predict exact prefill completion times based on input lengths. It explicitly ranks requests to maximize TTFT attainment. On the decode side, it uses Slack-Guided Adaptive Batching. "Slack" is the temporal gap between actual hardware execution time and the TPOT SLO maximum constraint. The engine aggressively packs as many short requests into the batch as mathematically possible, dilating the compute time precisely up to the slack threshold, thereby maximizing throughput without violating a single TPOT constraint.