Nsight Compute Kernel Analysis
Delivers interactive, cycle-level profiling of isolated CUDA kernels to expose hardware execution inefficiencies.
Source: mortalapps.com- Delivers interactive, cycle-level profiling of isolated CUDA kernels to expose hardware execution inefficiencies.
- The core purpose is to maximize Streaming Multiprocessor (SM) efficiency, optimize register allocation, and maximize memory throughput.
- The primary optimization idea relies on elevating theoretical occupancy and aligning memory access patterns to prevent warp stalls.
- The most important engineering insight is utilizing the empirical Roofline model, which definitively and mathematically classifies a kernel as compute-bound or memory-bound, dictating the subsequent optimization strategy.
Why This Matters
Once Nsight Systems guarantees that the GPU is constantly fed with execution work, Nsight Compute ensures that the allocated work executes as efficiently as silicon physics allows. In LLM inference serving, optimizing kernels like FlashAttention to squeeze 10% more throughput out of the Streaming Multiprocessors translates directly to massively increased tokens-per-second. This micro-optimization directly reduces the number of physical nodes required to serve a foundational model at a global scale, fundamentally altering unit economics.
Core Intuition
Kernel optimization inherently relies on identifying the limiting hardware resource within the silicon. The GPU is a massively parallel machine designed specifically to hide memory latency via extreme thread oversubscription and rapid context switching. If threads stall indefinitely waiting on DRAM returns (memory-bound) or lack sufficient integer and floating-point units to process instructions (compute-bound), total execution performance degrades. The mental model involves tracking a warp's execution state, identifying whether warps are actively issuing instructions or stalled on register dependencies.
Technical Deep Dive
Nsight Compute operates via a highly intrusive serialized replay mechanism. Because the hardware performance monitors (PMCs) cannot track all hundreds of internal metrics concurrently, the tool intercepts the kernel launch, records its initial memory state, and replays the kernel multiple times to collect multiplexed metrics accurately.
| Metric Category | Source / Variable |
|---|---|
| Interpretation | Instruction Throughput |
| smsp__sass_thread_inst_executed_op_* | Represents total opcodes dispatched per SM sub-partition, useful for tracking math density. |
| Memory Throughput | dram__bytes.sum, lts__t_bytes.sum |
| Tracks the precise data volume crossing the HBM, L2, and L1 physical interfaces. | Occupancy |
| sm__warps_active.avg | Defines the ratio of resident warps to the theoretical maximum, indicating latency hiding potential. |