GPU Utilization and Occupancy Tracking
Tracks hardware saturation physically at the Streaming Multiprocessor (SM) level.
Source: mortalapps.com- Tracks hardware saturation physically at the Streaming Multiprocessor (SM) level.
- The core purpose is determining if GPU compute resources are fundamentally underutilized despite appearing busy to the OS.
- The primary optimization idea relies on increasing concurrent warp execution to hide instruction and memory latency effectively.
- The most important engineering insight is that the standard nvidia-smi "GPU-Util" metric is a highly deceptive signal that only measures temporal execution, completely ignoring spatial saturation.
Why This Matters
Underutilized GPUs represent massive capital burn. If an expensive cluster of H100s operates at only 20% spatial utilization, 80% of the hardware investment is generating heat rather than advancing the model training loss. Accurate occupancy tracking dictates whether scaling distributed training algorithms is financially viable; if individual nodes are not spatially saturated, distributed scaling will only exacerbate network overhead, yielding severe negative returns on infrastructure investments.
Core Intuition
A "GPU-Util" reading of 100% simply means that at least one tiny kernel was executing during the OS sample window. It is exactly akin to declaring a 10-lane highway fully utilized simply because a single car is driving on it. True utilization, known as SM Efficiency, measures how many lanes are actually occupied simultaneously. Occupancy refers specifically to the ratio of active warps resident on an SM relative to the maximum number of warps the SM can physically support. Maintaining high occupancy is the primary mechanism GPUs employ to hide latency via rapid context switching.
Technical Deep Dive
NVIDIA GPU architecture relies on the GigaThread Engine to distribute thread blocks evenly to SMs. Each SM has a strictly finite number of registers, shared memory segments, and thread slots available.
| Metric | Source Definition |
|---|---|
| High-Signal Interpretation | GPU-Util |
| nvidia-smi | Percent of time over a sample period where >0 kernels were executing. Highly deceptive. |
| SM Efficiency | DCGM / ncu |
| Percentage of SMs actively processing warps. Low efficiency dictates poor parallelization. | Achieved Occupancy |
| Nsight Compute | The real ratio of active warps. Determines the hardware's latency hiding capability. |