AI Observability

GPU Utilization and Occupancy Tracking

Tracks hardware saturation physically at the Streaming Multiprocessor (SM) level.

Published June 1, 2026 · By MortalApps · 5 min read · ~826 words

TL;DR

Tracks hardware saturation physically at the Streaming Multiprocessor (SM) level.
The core purpose is determining if GPU compute resources are fundamentally underutilized despite appearing busy to the OS.
The primary optimization idea relies on increasing concurrent warp execution to hide instruction and memory latency effectively.
The most important engineering insight is that the standard nvidia-smi "GPU-Util" metric is a highly deceptive signal that only measures temporal execution, completely ignoring spatial saturation.

Why This Matters Intuition Deep Dive Takeaways Related

Why This Matters

Underutilized GPUs represent massive capital burn. If an expensive cluster of H100s operates at only 20% spatial utilization, 80% of the hardware investment is generating heat rather than advancing the model training loss. Accurate occupancy tracking dictates whether scaling distributed training algorithms is financially viable; if individual nodes are not spatially saturated, distributed scaling will only exacerbate network overhead, yielding severe negative returns on infrastructure investments.

Core Intuition

A "GPU-Util" reading of 100% simply means that at least one tiny kernel was executing during the OS sample window. It is exactly akin to declaring a 10-lane highway fully utilized simply because a single car is driving on it. True utilization, known as SM Efficiency, measures how many lanes are actually occupied simultaneously. Occupancy refers specifically to the ratio of active warps resident on an SM relative to the maximum number of warps the SM can physically support. Maintaining high occupancy is the primary mechanism GPUs employ to hide latency via rapid context switching.

Technical Deep Dive

NVIDIA GPU architecture relies on the GigaThread Engine to distribute thread blocks evenly to SMs. Each SM has a strictly finite number of registers, shared memory segments, and thread slots available.

Metric	Source Definition
High-Signal Interpretation	GPU-Util
nvidia-smi	Percent of time over a sample period where >0 kernels were executing. Highly deceptive.
SM Efficiency	DCGM / ncu
Percentage of SMs actively processing warps. Low efficiency dictates poor parallelization.	Achieved Occupancy
Nsight Compute	The real ratio of active warps. Determines the hardware's latency hiding capability.

Key Takeaways

100% GPU-Util is a necessary but entirely insufficient condition for optimized performance.

SM Efficiency measures the critical spatial utilization (parallelism) across the hardware.

Occupancy dictates the hardware's inherent ability to hide latency via warp context switching.

Total power draw acts as the ultimate sanity check for confirming true compute saturation.

Why This Matters

Core Intuition

Technical Deep Dive

Key Takeaways

Internals / Execution Flow

Performance Implications

Real-World Usage

Common Bottlenecks

Optimization Strategies

Tools & Frameworks

Interview Guidance

Related Concepts