GPU Occupancy and Register Allocation
Occupancy is the ratio of active warps to the theoretical maximum warps per SM. Hard physical limits: 64K registers per SM (256 KB), max 255 registers per
Source: mortalapps.com- Occupancy is the ratio of active warps to the theoretical maximum warps per SM.
- Hard physical limits: 64K registers per SM (256 KB), max 255 registers per thread.
- Max 32 thread blocks per SM (Compute 10.0/12.0).
- Blackwell's TMEM acts as a relief valve for register pressure.
Why This Matters
High occupancy is the only way a GPU hides memory latency. If you compile a kernel that demands 128 registers per thread, the math physically dictates that fewer threads can reside on the SM. If occupancy drops too low, the scheduler runs out of warps to swap to during memory fetches, leaving the $30,000 GPU effectively idle.
Core Intuition
If you have a hotel (the SM) with,000 beds (registers) and you limit each guest (thread) to a maximum of 255 beds. If every guest demands 200 beds, you can only host 320 guests, even if the hotel lobby (scheduler capacity) allows for 2048 people. To get more guests in, you must convince them to need fewer beds.
Technical Deep Dive
The B200 architecture (Compute 10.0) provisions a maximum of 32 concurrent thread blocks per SM. The Register File (RF) is 64K 32-bit registers (256 KB). In architectures like Ampere and Hopper, WGMMA instructions accumulated the large matrix multiplication results directly into the thread registers. This created immense register pressure. Blackwell introduces a paradigm shift: TMEM (256 KB per SM). By routing the accumulation directly into the dedicated Tensor Memory via tcgen05.mma, the thread registers are unburdened. This allows compilers to reallocate those 255 registers toward software pipelining and complex fused epilogues without hitting the occupancy ceiling.