LLM Inference Systems

Dynamic Batch Size Tuning

Dynamic batching continuously dilates and constricts batch sizes step-by-step to maximize SM utilization.

Published June 1, 2026 · By MortalApps · 3 min read · ~583 words

TL;DR

Dynamic batching continuously dilates and constricts batch sizes step-by-step to maximize SM utilization.
CPU dispatch overhead (Python logic, driver API calls) destroys dynamic batching latency.
Advanced runtimes solve this by asynchronously generating CUDA Graphs to encapsulate dynamic tensor shapes without blocking the GPU.

Why This Matters Intuition Deep Dive Takeaways Related

Why This Matters

If a cluster waits for CPU Python logic to decide the next batch size, pad tensors, and launch individual CUDA kernels, the execution time of the microsecond-fast GEMV operations is completely swallowed by CPU overhead. To maintain high cluster throughput, dynamic batching is necessary; to maintain latency, CPU overhead must be eliminated.

Core Intuition

Imagine a race car (the GPU) that can complete a lap in 10 microseconds. If the pit crew (the CPU driver) takes 50 microseconds to give instructions for the next lap, the car's speed is irrelevant. CUDA graphs act as a pre-programmed autonomous driving map. The CPU hands the GPU the map once, and the GPU executes the entire sequence of operations without asking for directions.

Technical Deep Dive

Native CUDA Graphs require absolutely static tensor shapes and deterministic control flows, which fundamentally conflicts with variable-length attention and dynamic batching. To bridge this gap, modern runtimes utilize hybrid execution. They maintain a cache of pre-captured CUDA graphs for specific "bucketed" sequence lengths and batch sizes (e.g., bs=1, 2, 4, 8). When the dynamic scheduler determines a batch size of 5 is optimal, it rounds up, pads the tensors to fit the pre-captured batch size 8 graph, and replays it, completely bypassing the CPU driver overhead.

Key Takeaways

Dynamic batching requires continuous shape changing, fighting against efficient GPU execution.

CPU dispatch time often exceeds actual GEMV compute time during decode.

Pre-captured, bucketed CUDA Graphs eliminate the driver overhead by replaying static topologies.

Asynchronous JIT execution must be overlapped with graph replay to handle stochastic sampling.

Why This Matters

Core Intuition

Technical Deep Dive

Key Takeaways

Internals / Execution Flow

Performance Implications

Real-World Usage

Common Bottlenecks

Optimization Strategies

Tools & Frameworks

Interview Guidance

Related Concepts