← Infrastructure Tensor Computing
Infrastructure

CUDA Warp Scheduling and Divergence

Warp schedulers hide latency by rapidly swapping active warps. Dependencies between instructions cause Scoreboard Stalls (Long vs. Short).

Source: mortalapps.com
TL;DR
  • Warp schedulers hide latency by rapidly swapping active warps.
  • Dependencies between instructions cause Scoreboard Stalls (Long vs. Short).
  • Blackwell's scheduler breaks the rigid warp-synchronous model, optimizing for high Instruction-Level Parallelism (ILP).
  • DEPBAR (Dependency Barrier) and B-registers manage fine-grained SASS dependencies.

Why This Matters

A GPU is a throughput machine, not a latency machine. A single memory fetch from HBM takes ~1000 cycles on Hopper. If the warp scheduler cannot find another warp to execute during those 1000 cycles, the entire SM stalls, and infrastructure utilization drops, wasting massive compute resources.

Core Intuition

Imagine a chess master playing 64 simultaneous games. The master (SM execution unit) makes a move (computes), then walks to the next board (context switches to another warp) while the opponent thinks (memory latency). If opponents think too long and no boards are ready, the master stands idle. The Warp Scheduler's job is to ensure there is always a board ready for the master.

Technical Deep Dive

Each warp has 6 hardware scoreboards (Dependency counters) and at least 16 B-registers for managing control flow. When an instruction is fetched, the compiler explicitly tags it with a scoreboard identifier in the SASS binary.

Short Scoreboard Stall: Occurs when waiting on variable latency instructions inside the SM (e.g., Special Math MUFU or shared memory LDS/STS).

Long Scoreboard Stall: Occurs when waiting on data leaving the SM (e.g., Global memory LDG.E.SYS). Blackwell's scheduler is specifically optimized for low-precision, high-ILP workloads with clean control flow, abandoning Hopper's reliance on deeper bulk concurrency.

Key Takeaways

Schedulers hide latency via zero-overhead context switching.
6 Scoreboards per warp track variable latency dependencies.
Short stalls = SMEM / Math latency.
Long stalls = Global Memory (HBM) latency.
Blackwell favors deep pipelining (ILP) over sheer concurrency.