Tensor Computing

GPU Hardware Generations (H100 → B200 → Future Architectures)

Blackwell (GB200/B200) represents a systemic leap over Hopper (H100). Dual-die architecture pushing 208 Billion transistors.

Published June 1, 2026 · By MortalApps · 5 min read · ~919 words

TL;DR

Blackwell (GB200/B200) represents a systemic leap over Hopper (H100).
Dual-die architecture pushing 208 Billion transistors.
Massive shift from register-bound math (WGMMA) to TMEM-bound math (UMMA, tcgen05.mma).
Future architectures predict on-package HBM, optical interconnects, and <4-bit precision.

Why This Matters Intuition Deep Dive Takeaways Related

Why This Matters

AI infrastructure is hardware-dictated. Code written optimally for Hopper (WGMMA) will run sub-optimally on Blackwell if it doesn't pivot to use Tensor Memory (TMEM). Engineers must anticipate hardware trajectories to design software stacks that scale seamlessly into future data centers.

Core Intuition

Hopper was about making the engine (Tensor Cores) as fast as possible and adding a fuel pump (TMA). Blackwell is about realizing the engine is so fast that the internal plumbing (Registers and SMEM) is bursting. Blackwell adds a massive dedicated fuel tank (TMEM) right next to the engine and a decompression system (DE) to squeeze more fuel through the pipes.

Technical Deep Dive

Hopper (H100):	Introduced TMA (Async copy).
Relied on wgmma () accumulating into the Register File.	1st Gen Transformer Engine (FP8).
Blackwell (B200 / GB100):	Dual-die architecture, 104 Billion transistors per die (208B total).
192 GB of HBM3e yielding 8.0 TB/s bandwidth.	Replaces wgmma with tcgen05.mma (), offloading accumulation to the 256 KB TMEM.
Hardware Decompression Engine (DE) to multiply effective memory bandwidth.	2nd Gen Transformer Engine (FP4, NVFP4).

NVLink scalable to 130 TB/s in the NVL72 rack domain.

Key Takeaways

Blackwell is a 208B transistor dual-die monolith.

TMEM completely changes GEMM kernel design.

FP4 micro-scaling unlocks 20 PFLOPS.

Future systems will push towards multi-die, >500B transistors, optical interconnects, and <4-bit precision formats.

When migrating code from H100 to B200:	Strip out register accumulation logic.
Initialize TMEM via tcgen05.alloc.	Route TMA loads to SMEM, then issue tcgen05.mma to compute into TMEM.

Utilize tcgen05.ld to move the final results to registers for epilogue activation functions.

Why This Matters

Core Intuition

Technical Deep Dive

Key Takeaways

Internals / Execution Flow

Performance Implications

Real-World Usage

Common Bottlenecks

Optimization Strategies

Tools & Frameworks

Interview Guidance

Related Concepts