CUDA Warp Scheduling and Divergence
Warp schedulers hide latency by rapidly swapping active warps. Dependencies between instructions cause Scoreboard Stalls (Long vs. Short).
Source: mortalapps.com- Warp schedulers hide latency by rapidly swapping active warps.
- Dependencies between instructions cause Scoreboard Stalls (Long vs. Short).
- Blackwell's scheduler breaks the rigid warp-synchronous model, optimizing for high Instruction-Level Parallelism (ILP).
- DEPBAR (Dependency Barrier) and B-registers manage fine-grained SASS dependencies.
Why This Matters
A GPU is a throughput machine, not a latency machine. A single memory fetch from HBM takes ~1000 cycles on Hopper. If the warp scheduler cannot find another warp to execute during those 1000 cycles, the entire SM stalls, and infrastructure utilization drops, wasting massive compute resources.
Core Intuition
Imagine a chess master playing 64 simultaneous games. The master (SM execution unit) makes a move (computes), then walks to the next board (context switches to another warp) while the opponent thinks (memory latency). If opponents think too long and no boards are ready, the master stands idle. The Warp Scheduler's job is to ensure there is always a board ready for the master.
Technical Deep Dive
Each warp has 6 hardware scoreboards (Dependency counters) and at least 16 B-registers for managing control flow. When an instruction is fetched, the compiler explicitly tags it with a scoreboard identifier in the SASS binary.
Short Scoreboard Stall: Occurs when waiting on variable latency instructions inside the SM (e.g., Special Math MUFU or shared memory LDS/STS).
Long Scoreboard Stall: Occurs when waiting on data leaving the SM (e.g., Global memory LDG.E.SYS). Blackwell's scheduler is specifically optimized for low-precision, high-ILP workloads with clean control flow, abandoning Hopper's reliance on deeper bulk concurrency.