← Infrastructure CUDA
Infrastructure

CUDA Graphs and Launch Overhead Elimination

Standard CUDA execution requires the CPU to pay a tangible launch overhead latency for every individual kernel submitted to the GPU stream.

Source: mortalapps.com
TL;DR
  • Standard CUDA execution requires the CPU to pay a tangible launch overhead latency for every individual kernel submitted to the GPU stream.
  • CUDA Graphs allow developers to define an entire topology of kernels and data dependencies exactly once, launching them with a single CPU operation.
  • Graphs can be constructed programmatically via explicit API calls or intrinsically via Stream Capture.
  • Advanced production workflows deploy a Hybrid approach to manage dynamic parameters efficiently without incurring the massive cost of recapturing the graph.

Why This Matters

Modern AI architectures, particularly those with deeply stacked architectural layers or highly fused operations, frequently execute kernels that take only a few microseconds to complete on the GPU. However, the CPU overhead to dispatch a single kernel typically takes 5 to 10 microseconds. When the GPU execution time drops below the CPU dispatch time, the GPU starves, idling while waiting for the next instruction. CUDA Graphs entirely eliminate this CPU boundness, ensuring the GPU remains fully saturated, leading to massive throughput increases specifically in LLM inference decoding phases.

Core Intuition

Imagine a fast-food kitchen (the GPU) operating under the direction of a cashier (the CPU). Normally, the cashier reads an order, walks to the kitchen, tells the fry cook what to do, walks back, reads the next order, and tells the grill cook. Even if the cooks are incredibly fast, they finish their task quickly and stand idle, waiting for the cashier to physically walk back and deliver the next order (Launch Overhead). CUDA Graphs act as a digital, automated display system. The cashier writes the entire day's workflow schedule exactly once (Instantiation), and the cooks execute the dependent steps autonomously based on the screen, completely eliminating the cashier's intervention for every micro-task.

Technical Deep Dive

Work submission utilizing CUDA Graphs operates within three highly distinct stages: Definition, Instantiation, and Execution. During the Definition phase, the developer creates a template of operations (nodes) and their chronological dependencies (edges). This is achieved via explicit node addition (cudaGraphAddKernelNode) or by simply bracketing legacy stream operations with cudaStreamBeginCapture and cudaStreamEndCapture. During Instantiation, the driver takes the graph template, algorithmically validates the dependency tree, and sets up physical memory mappings to minimize launch latency. The final output is a compiled cudaGraphExec_t. During Execution, the executable graph is launched into an active stream via a single dispatch.

A major architectural challenge is managing Dynamic Parameters. Stream capture inherently records parameters by value. If a model parameter changes in the next iteration (e.g., sequence length or an updated data pointer), the captured graph is rendered invalid. Re-instantiating the entire graph from scratch is computationally expensive. To resolve this, a hybrid approach utilizes Stream Capture for the static topology and extracts the internal graph context using cudaStreamGetCaptureInfo_v2. It then manually injects dynamic nodes via cudaGraphAddKernelNode, allowing fast, lightweight parameter updates later using cudaGraphExecKernelNodeSetParams.

Key Takeaways

CUDA Graphs surgically solve CPU bottlenecking by shifting the computational cost of kernel dispatch away from execution time into initialization time.
Stream capture makes converting massive legacy streams into graphs functionally trivial, but restricts parameters by locking them by value.
To handle dynamic shapes, parameters must be explicitly updated using cudaGraphExecKernelNodeSetParams or data must be directly copied into statically mapped pointers.
CUDA Graphs enforce a highly rigid, synchronization-free execution pipeline on the GPU, generating the absolute maximum throughput possible for repetitive tasks like autoregressive LLM decoding.