CUDA

CUDA Graphs and Launch Overhead Elimination

Standard CUDA execution requires the CPU to pay a tangible launch overhead latency for every individual kernel submitted to the GPU stream.

Published June 1, 2026 · By MortalApps · 6 min read · ~1,106 words

TL;DR

Standard CUDA execution requires the CPU to pay a tangible launch overhead latency for every individual kernel submitted to the GPU stream.
CUDA Graphs allow developers to define an entire topology of kernels and data dependencies exactly once, launching them with a single CPU operation.
Graphs can be constructed programmatically via explicit API calls or intrinsically via Stream Capture.
Advanced production workflows deploy a Hybrid approach to manage dynamic parameters efficiently without incurring the massive cost of recapturing the graph.

Why This Matters Intuition Deep Dive Takeaways Related

Why This Matters

Modern AI architectures, particularly those with deeply stacked architectural layers or highly fused operations, frequently execute kernels that take only a few microseconds to complete on the GPU. However, the CPU overhead to dispatch a single kernel typically takes 5 to 10 microseconds. When the GPU execution time drops below the CPU dispatch time, the GPU starves, idling while waiting for the next instruction. CUDA Graphs entirely eliminate this CPU boundness, ensuring the GPU remains fully saturated, leading to massive throughput increases specifically in LLM inference decoding phases.

Core Intuition

Imagine a fast-food kitchen (the GPU) operating under the direction of a cashier (the CPU). Normally, the cashier reads an order, walks to the kitchen, tells the fry cook what to do, walks back, reads the next order, and tells the grill cook. Even if the cooks are incredibly fast, they finish their task quickly and stand idle, waiting for the cashier to physically walk back and deliver the next order (Launch Overhead). CUDA Graphs act as a digital, automated display system. The cashier writes the entire day's workflow schedule exactly once (Instantiation), and the cooks execute the dependent steps autonomously based on the screen, completely eliminating the cashier's intervention for every micro-task.

Technical Deep Dive

Work submission utilizing CUDA Graphs operates within three highly distinct stages: Definition, Instantiation, and Execution. During the Definition phase, the developer creates a template of operations (nodes) and their chronological dependencies (edges). This is achieved via explicit node addition (cudaGraphAddKernelNode) or by simply bracketing legacy stream operations with cudaStreamBeginCapture and cudaStreamEndCapture. During Instantiation, the driver takes the graph template, algorithmically validates the dependency tree, and sets up physical memory mappings to minimize launch latency. The final output is a compiled cudaGraphExec_t. During Execution, the executable graph is launched into an active stream via a single dispatch.

A major architectural challenge is managing Dynamic Parameters. Stream capture inherently records parameters by value. If a model parameter changes in the next iteration (e.g., sequence length or an updated data pointer), the captured graph is rendered invalid. Re-instantiating the entire graph from scratch is computationally expensive. To resolve this, a hybrid approach utilizes Stream Capture for the static topology and extracts the internal graph context using cudaStreamGetCaptureInfo_v2. It then manually injects dynamic nodes via cudaGraphAddKernelNode, allowing fast, lightweight parameter updates later using cudaGraphExecKernelNodeSetParams.

Key Takeaways

CUDA Graphs surgically solve CPU bottlenecking by shifting the computational cost of kernel dispatch away from execution time into initialization time.

Stream capture makes converting massive legacy streams into graphs functionally trivial, but restricts parameters by locking them by value.

To handle dynamic shapes, parameters must be explicitly updated using cudaGraphExecKernelNodeSetParams or data must be directly copied into statically mapped pointers.

CUDA Graphs enforce a highly rigid, synchronization-free execution pipeline on the GPU, generating the absolute maximum throughput possible for repetitive tasks like autoregressive LLM decoding.

The lifecycle of a CUDA graph involves complex interplay between the stream API and the execution engine.

Execution Phase	Hardware / API Command	Purpose
Stream Capture	cudaStreamBeginCapture	Intercepts all subsequent kernel launches, redirecting them from the hardware queue into a software graph template.
Hybrid Modification	cudaStreamGetCaptureInfo_v2	The host intercepts the actively capturing graph to explicitly embed manual nodes for future dynamic updating.
Finalization	cudaStreamEndCapture	Completes the definition phase, returning a finalized cudaGraph_t.
Compilation	cudaGraphInstantiate	Compiles the graph topology, producing the executable cudaGraphExec_t.
Execution Loop	cudaGraphLaunch	The CPU issues a single command to execute the entire topology.
Parameter Update	cudaGraphExecKernelNodeSetParams	Updates pointers or configurations within specific nodes without destroying the graph.

Deploying CUDA Graphs yields profound performance shifts.	Performance Metric	Impact Analysis
CPU Offloading	Completely removes CPU dispatch overhead, successfully eliminating the "gaps" in GPU execution timelines observed in profiling tools.	Instantiation Latency
The cudaGraphInstantiate call is exceptionally heavy. Executable graphs must be aggressively cached and reused across many loops to recoup this initialization cost.	Memory Mapping	CUDA maps physical memory aggressively during instantiation. Re-launching graphs that share exact memory spaces avoids triggering expensive remapping page faults.

The Graph ecosystem relies on deep integration with the low-level CUDA toolkit.

Tool / Concept	Application Context
CUDA C++ API	Manages cudaGraph_t, cudaGraphExec_t, and stream capture semantics.
PyTorch Graph API	Exposes Pythonic wrappers (torch.cuda.CUDAGraph) to abstract compilation complexity.
Nsight Systems	Visually demonstrates success by rendering a tightly packed timeline of GPU kernels completely lacking CPU launch gaps.

Why This Matters

Core Intuition

Technical Deep Dive

Key Takeaways

Internals / Execution Flow

Performance Implications

Real-World Usage

Common Bottlenecks

Optimization Strategies

Tools & Frameworks

Interview Guidance

Related Concepts