Tensor Computing

Streaming Multiprocessor (SM) Architecture

The SM is the atomic unit of GPU compute capability. Blackwell GB100 die (datacenter) contains 132 SMs.

Published June 1, 2026 · By MortalApps · 3 min read · ~569 words

TL;DR

The SM is the atomic unit of GPU compute capability.
Blackwell GB100 die (datacenter) contains 132 SMs.
Each Blackwell SM features 128 CUDA cores and 4 Fifth-Generation Tensor Cores.
Key architectural shift: Addition of 256 KB of Tensor Memory (TMEM) per SM specifically for MMA staging.

Why This Matters Intuition Deep Dive Takeaways Related

Why This Matters

To fully utilize a $30,000+ GPU, infrastructure engineers must balance their workload exactly against the SM's physical boundaries. Understanding the exact limits of Register Files, TMEM, SMEM, and execution units prevents catastrophic performance cliffs caused by register spilling or execution starvation.

Core Intuition

Think of the GPU as a massive factory floor, and the SM as an individual assembly line. You can only assign as many workers (threads) to the line as there are lockers (registers) and staging tables (SMEM/TMEM). If you give the workers tools that are too complex (high register usage), you must fire half the workers, leaving the actual assembly machines (Tensor Cores) sitting idle.

Technical Deep Dive

A full Blackwell GB100 die (datacenter) features 132 SMs. (The gaming GB202 die, used in RTX 5090, features 192 SMs but is a distinct product.)16 At the microarchitectural level, a single B200 SM contains:

Compute: 128 CUDA Cores (FP32/INT32), 4 Fifth-Gen Tensor Cores, and 4 Special Function Units (SFUs).

Registers: 64K 32-bit registers, totaling 256 KB.

L1 / SMEM: 228 KB of configurable Shared Memory (Compute 10.0 datacenter).

TMEM: 256 KB of dedicated Tensor Memory.

Blackwell radically rearchitects the memory hierarchy. The 256 KB TMEM is a dedicated SRAM block that is entirely separate from standard SMEM. It acts as a warp-synchronous vault.

Key Takeaways

GB100 (datacenter B200) contains 132 SMs.

B200 features 228 KB SMEM and 256 KB TMEM per SM.

4 Fifth-gen Tensor Cores per SM process data from TMEM.

Physical storage limits (Registers, SMEM) dictate concurrent thread occupancy.

When a block is scheduled on an SM:	Thread states are mapped into the 256 KB Register File.
L1/SMEM is dynamically partitioned based on kernel configuration (up to 227 KB usable).	The warp scheduler dispatches instructions. Regular math hits CUDA cores; complex AI math (tcgen05.mma) hits the 4 Tensor Cores.

Intermediate accumulator values are held in TMEM, preventing the 256 KB register file from overflowing.

Why This Matters

Core Intuition

Technical Deep Dive

Key Takeaways

Internals / Execution Flow

Performance Implications

Real-World Usage

Common Bottlenecks

Optimization Strategies

Tools & Frameworks

Interview Guidance

Related Concepts