← Infrastructure LLM Inference Systems
Infrastructure

Token-Level Runtime Scheduling

Token-level scheduling allows inference engines to preempt, swap, and reprioritize requests after any generated token.

Source: mortalapps.com
TL;DR
  • Token-level scheduling allows inference engines to preempt, swap, and reprioritize requests after any generated token.
  • It utilizes Multi-Level Feedback Queues (MLFQ) with skip-join mechanics to prevent head-of-line blocking.
  • Heavily relies on moving KV cache states between GPU HBM and CPU DRAM to manage priority.
  • Solves the indeterminate generation length problem of autoregressive models.

Why This Matters

First-Come-First-Served (FCFS) scheduling causes catastrophic head-of-line blocking. A massive sequence requiring,000 tokens will starve a small 10-token "yes/no" query if they share a queue. Because LLMs cannot predict output length precisely, standard Shortest Remaining Processing Time (SRPT) schedulers fail. Token-level preemption guarantees low latency for short queries without permanently halting long computations.

Core Intuition

Think of token-level scheduling like a modern operating system context-switching CPU threads. Instead of running a whole process to completion, the engine runs a request for a "time slice" (a set token budget). If the request exceeds the budget, it is preempted, its memory state is saved, and a shorter job is given execution time.

Technical Deep Dive

Frameworks like FastServe implement a skip-join MLFQ. Because input prompt length is known, the scheduler intelligently "skips" the highest priority queues for requests with massive prompts, predicting their heavy compute footprint. During execution, if a request generates too many tokens, it is demoted to a lower priority queue. The KV cache of the preempted request is physically swapped from GPU HBM to CPU memory over PCIe/NVLink, and its execution is paused until it bubbles back up the priority stack.

Key Takeaways

LLM output lengths are unpredictable, invalidating standard SRPT schedulers.
Token-level preemption treats generation iterations as interruptible CPU time-slices.
Skip-join MLFQ uses known input length to estimate initial priority.
Efficient memory swapping to host DRAM is strictly required to free HBM during preemption.