LLM Inference Systems

Continuous Batching Systems

Continuous batching (iteration-level scheduling) dynamically injects new requests into the GPU execution batch at the microsecond a sequence completes.

Published June 1, 2026 · By MortalApps · 4 min read · ~644 words

TL;DR

Continuous batching (iteration-level scheduling) dynamically injects new requests into the GPU execution batch at the microsecond a sequence completes.
It solves the catastrophic hardware underutilization caused by static batching waiting for the longest sequence to finish.
Relies intrinsically on PagedAttention to eliminate memory fragmentation.
Maintains high arithmetic intensity by saturating max_num_seqs and max_num_batched_tokens thresholds.

Why This Matters Intuition Deep Dive Takeaways Related

Why This Matters

In production, request lengths follow a long-tail distribution. If an engine statically batches a,000-token output with a 10-token output, the shorter request monopolizes VRAM while the GPU idles for 990 iterations. Continuous batching increases real-world cluster throughput by up to 40x compared to naive synchronous execution, directly translating to massive reductions in GPU capital expenditure.

Core Intuition

Think of static batching as a bus that waits for all passengers to disembark at the final stop before letting anyone new on. Continuous batching is a rotating door: the exact millisecond a passenger exits, a new one steps into the empty slot. The GPU Streaming Multiprocessors (SMs) never realize the workload changed; they simply execute the next iteration's matrix multiplication over whichever tokens are currently active.

Technical Deep Dive

Continuous batching divorces the physical memory footprint of a sequence from its logical length. It requires a virtual memory management system (PagedAttention) where the Key-Value (KV) cache is divided into fixed-size physical blocks (e.g., 16 or 32 tokens). A central block table maps logical sequence tokens to non-contiguous physical blocks scattered across High Bandwidth Memory (HBM). When an end-of-sequence (EOS) token is emitted, the system immediately invalidates that sequence's block table pointers, freeing physical blocks, and immediately maps a newly arrived prompt to those exact physical slots for the very next clock cycle.

Key Takeaways

Static batching is obsolete for autoregressive generation.

Continuous batching operates at the granularity of a single forward pass.

It requires decoupled virtual-to-physical memory mapping.

It ensures maximum arithmetic intensity by keeping the batch size constantly saturated.

The CPU scheduler becomes the most critical latency path in the engine.

Why This Matters

Core Intuition

Technical Deep Dive

Key Takeaways

Internals / Execution Flow

Performance Implications

Real-World Usage

Common Bottlenecks

Optimization Strategies

Tools & Frameworks

Interview Guidance

Related Concepts