CUDA Graphs and Launch Overhead Elimination
Standard CUDA execution requires the CPU to pay a tangible launch overhead latency for every individual kernel submitted to the GPU stream.
Source: mortalapps.com- Standard CUDA execution requires the CPU to pay a tangible launch overhead latency for every individual kernel submitted to the GPU stream.
- CUDA Graphs allow developers to define an entire topology of kernels and data dependencies exactly once, launching them with a single CPU operation.
- Graphs can be constructed programmatically via explicit API calls or intrinsically via Stream Capture.
- Advanced production workflows deploy a Hybrid approach to manage dynamic parameters efficiently without incurring the massive cost of recapturing the graph.
Why This Matters
Modern AI architectures, particularly those with deeply stacked architectural layers or highly fused operations, frequently execute kernels that take only a few microseconds to complete on the GPU. However, the CPU overhead to dispatch a single kernel typically takes 5 to 10 microseconds. When the GPU execution time drops below the CPU dispatch time, the GPU starves, idling while waiting for the next instruction. CUDA Graphs entirely eliminate this CPU boundness, ensuring the GPU remains fully saturated, leading to massive throughput increases specifically in LLM inference decoding phases.
Core Intuition
Imagine a fast-food kitchen (the GPU) operating under the direction of a cashier (the CPU). Normally, the cashier reads an order, walks to the kitchen, tells the fry cook what to do, walks back, reads the next order, and tells the grill cook. Even if the cooks are incredibly fast, they finish their task quickly and stand idle, waiting for the cashier to physically walk back and deliver the next order (Launch Overhead). CUDA Graphs act as a digital, automated display system. The cashier writes the entire day's workflow schedule exactly once (Instantiation), and the cooks execute the dependent steps autonomously based on the screen, completely eliminating the cashier's intervention for every micro-task.
Technical Deep Dive
Work submission utilizing CUDA Graphs operates within three highly distinct stages: Definition, Instantiation, and Execution. During the Definition phase, the developer creates a template of operations (nodes) and their chronological dependencies (edges). This is achieved via explicit node addition (cudaGraphAddKernelNode) or by simply bracketing legacy stream operations with cudaStreamBeginCapture and cudaStreamEndCapture. During Instantiation, the driver takes the graph template, algorithmically validates the dependency tree, and sets up physical memory mappings to minimize launch latency. The final output is a compiled cudaGraphExec_t. During Execution, the executable graph is launched into an active stream via a single dispatch.
A major architectural challenge is managing Dynamic Parameters. Stream capture inherently records parameters by value. If a model parameter changes in the next iteration (e.g., sequence length or an updated data pointer), the captured graph is rendered invalid. Re-instantiating the entire graph from scratch is computationally expensive. To resolve this, a hybrid approach utilizes Stream Capture for the static topology and extracts the internal graph context using cudaStreamGetCaptureInfo_v2. It then manually injects dynamic nodes via cudaGraphAddKernelNode, allowing fast, lightweight parameter updates later using cudaGraphExecKernelNodeSetParams.