AI Observability

Nsight Compute Kernel Analysis

Delivers interactive, cycle-level profiling of isolated CUDA kernels to expose hardware execution inefficiencies.

Published June 1, 2026 · By MortalApps · 5 min read · ~909 words

TL;DR

Delivers interactive, cycle-level profiling of isolated CUDA kernels to expose hardware execution inefficiencies.
The core purpose is to maximize Streaming Multiprocessor (SM) efficiency, optimize register allocation, and maximize memory throughput.
The primary optimization idea relies on elevating theoretical occupancy and aligning memory access patterns to prevent warp stalls.
The most important engineering insight is utilizing the empirical Roofline model, which definitively and mathematically classifies a kernel as compute-bound or memory-bound, dictating the subsequent optimization strategy.

Why This Matters Intuition Deep Dive Takeaways Related

Why This Matters

Once Nsight Systems guarantees that the GPU is constantly fed with execution work, Nsight Compute ensures that the allocated work executes as efficiently as silicon physics allows. In LLM inference serving, optimizing kernels like FlashAttention to squeeze 10% more throughput out of the Streaming Multiprocessors translates directly to massively increased tokens-per-second. This micro-optimization directly reduces the number of physical nodes required to serve a foundational model at a global scale, fundamentally altering unit economics.

Core Intuition

Kernel optimization inherently relies on identifying the limiting hardware resource within the silicon. The GPU is a massively parallel machine designed specifically to hide memory latency via extreme thread oversubscription and rapid context switching. If threads stall indefinitely waiting on DRAM returns (memory-bound) or lack sufficient integer and floating-point units to process instructions (compute-bound), total execution performance degrades. The mental model involves tracking a warp's execution state, identifying whether warps are actively issuing instructions or stalled on register dependencies.

Technical Deep Dive

Nsight Compute operates via a highly intrusive serialized replay mechanism. Because the hardware performance monitors (PMCs) cannot track all hundreds of internal metrics concurrently, the tool intercepts the kernel launch, records its initial memory state, and replays the kernel multiple times to collect multiplexed metrics accurately.

Metric Category	Source / Variable
Interpretation	Instruction Throughput
smsp__sass_thread_inst_executed_op_*	Represents total opcodes dispatched per SM sub-partition, useful for tracking math density.
Memory Throughput	dram__bytes.sum, lts__t_bytes.sum
Tracks the precise data volume crossing the HBM, L2, and L1 physical interfaces.	Occupancy
sm__warps_active.avg	Defines the ratio of resident warps to the theoretical maximum, indicating latency hiding potential.

Key Takeaways

Kernel replay destroys system-level concurrency but provides necessary, cycle-accurate hardware metrics.

Roofline plots visually and definitively diagnose memory versus compute execution boundaries.

Engineers must never profile whole AI applications with ncu; targeting specific, hot-path kernels is mandatory.

SM Throughput and Memory Throughput metrics represent the maximums of their respective underlying sub-metrics.

Optimizing thread register counts frequently yields higher throughput than maximizing sheer hardware occupancy.

Why This Matters

Core Intuition

Technical Deep Dive

Key Takeaways

Internals / Execution Flow

Performance Implications

Real-World Usage

Common Bottlenecks

Optimization Strategies

Tools & Frameworks

Interview Guidance

Performance Comparisons

Related Concepts