← Infrastructure AI Observability
Infrastructure

GPU Utilization and Occupancy Tracking

Tracks hardware saturation physically at the Streaming Multiprocessor (SM) level.

Source: mortalapps.com
TL;DR
  • Tracks hardware saturation physically at the Streaming Multiprocessor (SM) level.
  • The core purpose is determining if GPU compute resources are fundamentally underutilized despite appearing busy to the OS.
  • The primary optimization idea relies on increasing concurrent warp execution to hide instruction and memory latency effectively.
  • The most important engineering insight is that the standard nvidia-smi "GPU-Util" metric is a highly deceptive signal that only measures temporal execution, completely ignoring spatial saturation.

Why This Matters

Underutilized GPUs represent massive capital burn. If an expensive cluster of H100s operates at only 20% spatial utilization, 80% of the hardware investment is generating heat rather than advancing the model training loss. Accurate occupancy tracking dictates whether scaling distributed training algorithms is financially viable; if individual nodes are not spatially saturated, distributed scaling will only exacerbate network overhead, yielding severe negative returns on infrastructure investments.

Core Intuition

A "GPU-Util" reading of 100% simply means that at least one tiny kernel was executing during the OS sample window. It is exactly akin to declaring a 10-lane highway fully utilized simply because a single car is driving on it. True utilization, known as SM Efficiency, measures how many lanes are actually occupied simultaneously. Occupancy refers specifically to the ratio of active warps resident on an SM relative to the maximum number of warps the SM can physically support. Maintaining high occupancy is the primary mechanism GPUs employ to hide latency via rapid context switching.

Technical Deep Dive

NVIDIA GPU architecture relies on the GigaThread Engine to distribute thread blocks evenly to SMs. Each SM has a strictly finite number of registers, shared memory segments, and thread slots available.

MetricSource Definition
High-Signal InterpretationGPU-Util
nvidia-smiPercent of time over a sample period where >0 kernels were executing. Highly deceptive.
SM EfficiencyDCGM / ncu
Percentage of SMs actively processing warps. Low efficiency dictates poor parallelization.Achieved Occupancy
Nsight ComputeThe real ratio of active warps. Determines the hardware's latency hiding capability.

Key Takeaways

100% GPU-Util is a necessary but entirely insufficient condition for optimized performance.
SM Efficiency measures the critical spatial utilization (parallelism) across the hardware.
Occupancy dictates the hardware's inherent ability to hide latency via warp context switching.
Register spilling completely negates the theoretical benefits of forced high occupancy.
Total power draw acts as the ultimate sanity check for confirming true compute saturation.