← Infrastructure LLM Inference Systems
Infrastructure

Continuous Batching Systems

Continuous batching (iteration-level scheduling) dynamically injects new requests into the GPU execution batch at the microsecond a sequence completes.

Source: mortalapps.com
TL;DR
  • Continuous batching (iteration-level scheduling) dynamically injects new requests into the GPU execution batch at the microsecond a sequence completes.
  • It solves the catastrophic hardware underutilization caused by static batching waiting for the longest sequence to finish.
  • Relies intrinsically on PagedAttention to eliminate memory fragmentation.
  • Maintains high arithmetic intensity by saturating max_num_seqs and max_num_batched_tokens thresholds.

Why This Matters

In production, request lengths follow a long-tail distribution. If an engine statically batches a,000-token output with a 10-token output, the shorter request monopolizes VRAM while the GPU idles for 990 iterations. Continuous batching increases real-world cluster throughput by up to 40x compared to naive synchronous execution, directly translating to massive reductions in GPU capital expenditure.

Core Intuition

Think of static batching as a bus that waits for all passengers to disembark at the final stop before letting anyone new on. Continuous batching is a rotating door: the exact millisecond a passenger exits, a new one steps into the empty slot. The GPU Streaming Multiprocessors (SMs) never realize the workload changed; they simply execute the next iteration's matrix multiplication over whichever tokens are currently active.

Technical Deep Dive

Continuous batching divorces the physical memory footprint of a sequence from its logical length. It requires a virtual memory management system (PagedAttention) where the Key-Value (KV) cache is divided into fixed-size physical blocks (e.g., 16 or 32 tokens). A central block table maps logical sequence tokens to non-contiguous physical blocks scattered across High Bandwidth Memory (HBM). When an end-of-sequence (EOS) token is emitted, the system immediately invalidates that sequence's block table pointers, freeing physical blocks, and immediately maps a newly arrived prompt to those exact physical slots for the very next clock cycle.

Key Takeaways

Static batching is obsolete for autoregressive generation.
Continuous batching operates at the granularity of a single forward pass.
It requires decoupled virtual-to-physical memory mapping.
It ensures maximum arithmetic intensity by keeping the batch size constantly saturated.
The CPU scheduler becomes the most critical latency path in the engine.