LLM Inference Systems

SLA-Aware Request Scheduling

Replaces First-Come-First-Served (FCFS) with probabilistic, SLA-targeted scheduling mechanisms.

Published June 1, 2026 · By MortalApps · 3 min read · ~559 words

TL;DR

Replaces First-Come-First-Served (FCFS) with probabilistic, SLA-targeted scheduling mechanisms.
Decouples TTFT urgency logic (Prefill) from Time-Per-Output-Token (TPOT) packing logic (Decode).
Implements "slack-guided" batching: packing extra requests into the decode batch right up to the strict latency limit.

Why This Matters Intuition Deep Dive Takeaways Related

Why This Matters

In commercial deployments, raw "Goodput" (total tokens per second) is completely meaningless if 20% of requests violate their Service Level Agreement (SLA) timeouts and are dropped by the client API. Commercial engines must be optimized to guarantee 99th-percentile (P99) latency bounds, sacrificing peak theoretical throughput for unyielding stability.

Core Intuition

If you have to deliver 5 packages by 5:00 PM, and it's 4:00 PM, you don't arbitrarily drop off the closest one first (FCFS). You calculate the exact drive time to each destination and schedule the route to ensure the most "in-danger" package arrives at 4:59 PM. You use the "slack" (extra time) to optimize the route.

Technical Deep Dive

Frameworks like Kairos decouple SLO handling into two mechanisms. For the prefill side, it utilizes Urgency-Based Priority Scheduling, leveraging ML regression to continuously predict exact prefill completion times based on input lengths. It explicitly ranks requests to maximize TTFT attainment. On the decode side, it uses Slack-Guided Adaptive Batching. "Slack" is the temporal gap between actual hardware execution time and the TPOT SLO maximum constraint. The engine aggressively packs as many short requests into the batch as mathematically possible, dilating the compute time precisely up to the slack threshold, thereby maximizing throughput without violating a single TPOT constraint.

Key Takeaways

Commercial viability depends on strict P99 latency bounds, not just raw throughput.

Kairos splits logic into TTFT urgency scheduling and TPOT slack packing.

Slack-guided batching mathematically stuffs the GPU until the SLA limit is reached.

Hardware integration (GH200 NVLink-C2C) allows instant rotation of low-priority tasks.

Why This Matters

Core Intuition

Technical Deep Dive

Key Takeaways

Internals / Execution Flow

Performance Implications

Real-World Usage

Common Bottlenecks

Optimization Strategies

Tools & Frameworks

Interview Guidance

Related Concepts