← Infrastructure Tensor Computing
Infrastructure

SIMD vs SIMT Execution Models

CPUs use SIMD (Single Instruction, Multiple Data); GPUs use SIMT (Single Instruction, Multiple Threads).

Source: mortalapps.com
TL;DR
  • CPUs use SIMD (Single Instruction, Multiple Data); GPUs use SIMT (Single Instruction, Multiple Threads).
  • SIMD uses a single instruction for a vector register; SIMT provides a scalar programming model mapped to vector hardware.
  • SIMT allows for Independent Thread Scheduling (ITS), giving every thread its own program counter.
  • Crucial for highly divergent AI workloads like Mixture of Experts (MoE).

Why This Matters

Engineers migrating from CPU (C++/AVX-512) to GPU (CUDA) must remap their mental model. Treating a GPU warp purely like a CPU vector unit leads to deadlocks in modern AI kernels, especially during complex dynamic routing tasks or tree-search inference algorithms where control flow diverges aggressively.

Core Intuition

In SIMD, you drive a bus; all passengers (data elements) must go to the exact same destination at the same time. If one passenger needs a detour (an if statement), the bus must drive the detour with everyone on board, masking out the ones who don't care. In SIMT, you give 32 passengers their own bicycles (threads) and a map. They try to stay in a pack (a warp) for efficiency, but if they need to split up at a fork in the road (divergence), they can, and they will regroup later (reconvergence).

Technical Deep Dive

NVIDIA's SIMT architecture groups 32 threads into a warp. Up until the Volta architecture, threads in a warp shared a single program counter (PC) and an active mask. If a branch occurred, the warp executed sequentially: first the if path (masking the else threads), then the else path.

Starting with Volta, and persisting through Hopper and Blackwell, NVIDIA implemented Independent Thread Scheduling (ITS). Every thread maintains its own Program Counter and Call Stack. While the hardware still executes threads physically in a warp grouping for instruction fetch efficiency, the scheduler can yield and swap individual threads, preventing deadlocks when threads within a warp need to acquire fine-grained locks or synchronize independently.

Key Takeaways

SIMT abstracts vector hardware behind a scalar programming model.
Independent Thread Scheduling prevents deadlocks on intra-warp synchronization.
Physical execution is still lockstep; divergence halves compute efficiency.
Reconvergence is managed automatically by hardware via B-registers.