Token-Level Runtime Scheduling
Token-level scheduling allows inference engines to preempt, swap, and reprioritize requests after any generated token.
Source: mortalapps.com- Token-level scheduling allows inference engines to preempt, swap, and reprioritize requests after any generated token.
- It utilizes Multi-Level Feedback Queues (MLFQ) with skip-join mechanics to prevent head-of-line blocking.
- Heavily relies on moving KV cache states between GPU HBM and CPU DRAM to manage priority.
- Solves the indeterminate generation length problem of autoregressive models.
Why This Matters
First-Come-First-Served (FCFS) scheduling causes catastrophic head-of-line blocking. A massive sequence requiring,000 tokens will starve a small 10-token "yes/no" query if they share a queue. Because LLMs cannot predict output length precisely, standard Shortest Remaining Processing Time (SRPT) schedulers fail. Token-level preemption guarantees low latency for short queries without permanently halting long computations.
Core Intuition
Think of token-level scheduling like a modern operating system context-switching CPU threads. Instead of running a whole process to completion, the engine runs a request for a "time slice" (a set token budget). If the request exceeds the budget, it is preempted, its memory state is saved, and a shorter job is given execution time.
Technical Deep Dive
Frameworks like FastServe implement a skip-join MLFQ. Because input prompt length is known, the scheduler intelligently "skips" the highest priority queues for requests with massive prompts, predicting their heavy compute footprint. During execution, if a request generates too many tokens, it is demoted to a lower priority queue. The KV cache of the preempted request is physically swapped from GPU HBM to CPU memory over PCIe/NVLink, and its execution is paused until it bubbles back up the priority stack.