Dynamic Batch Size Tuning
Dynamic batching continuously dilates and constricts batch sizes step-by-step to maximize SM utilization.
Source: mortalapps.com- Dynamic batching continuously dilates and constricts batch sizes step-by-step to maximize SM utilization.
- CPU dispatch overhead (Python logic, driver API calls) destroys dynamic batching latency.
- Advanced runtimes solve this by asynchronously generating CUDA Graphs to encapsulate dynamic tensor shapes without blocking the GPU.
Why This Matters
If a cluster waits for CPU Python logic to decide the next batch size, pad tensors, and launch individual CUDA kernels, the execution time of the microsecond-fast GEMV operations is completely swallowed by CPU overhead. To maintain high cluster throughput, dynamic batching is necessary; to maintain latency, CPU overhead must be eliminated.
Core Intuition
Imagine a race car (the GPU) that can complete a lap in 10 microseconds. If the pit crew (the CPU driver) takes 50 microseconds to give instructions for the next lap, the car's speed is irrelevant. CUDA graphs act as a pre-programmed autonomous driving map. The CPU hands the GPU the map once, and the GPU executes the entire sequence of operations without asking for directions.
Technical Deep Dive
Native CUDA Graphs require absolutely static tensor shapes and deterministic control flows, which fundamentally conflicts with variable-length attention and dynamic batching. To bridge this gap, modern runtimes utilize hybrid execution. They maintain a cache of pre-captured CUDA graphs for specific "bucketed" sequence lengths and batch sizes (e.g., bs=1, 2, 4, 8). When the dynamic scheduler determines a batch size of 5 is optimal, it rounds up, pads the tensors to fit the pre-captured batch size 8 graph, and replays it, completely bypassing the CPU driver overhead.