← Infrastructure GPU Memory Systems
Infrastructure

GPU Memory Hierarchy (Registers, Shared, Global)

The GPU memory hierarchy spans from ultra-fast thread-local registers, to user-managed Shared Memory (L1), the L2 Cache, and finally high-latency Global

Source: mortalapps.com
TL;DR
  • The GPU memory hierarchy spans from ultra-fast thread-local registers, to user-managed Shared Memory (L1), the L2 Cache, and finally high-latency Global Memory (HBM).
  • The core purpose of this hierarchy is bridging the multi-order-of-magnitude latency gap between arithmetic execution units and main memory.
  • The primary optimization idea is aggressive data reuse: moving data up the hierarchy once and computing on it multiple times before eviction.
  • The most important engineering insight is that thread occupancy is strictly bounded by the physical capacity of the highest levels of the hierarchy, specifically registers and shared memory.

Why This Matters

Understanding this hierarchy is the absolute foundation of writing custom, high-performance CUDA or Triton kernels for AI infrastructure. Poor utilization of the hierarchy forces the GPU's compute units to idle while waiting hundreds of clock cycles for global memory fetches to resolve. Optimizing the traversal of data through the hierarchy strictly dictates whether a matrix multiplication kernel achieves 20% or 90% of theoretical peak FLOPS.

Core Intuition

Think of the GPU memory hierarchy as a massive hourglass of latency and capacity. At the top, the register file offers massive aggregate bandwidth (exceeding 100 TB/s across the chip) but provides only kilobytes of capacity per Streaming Multiprocessor (SM). At the bottom, global memory offers gigabytes of capacity but introduces crippling latency, typically requiring 300 to 500 cycles to retrieve data. Shared memory acts as a programmable, user-managed staging ground where developers manually orchestrate caching to bypass the latency of global memory.

Technical Deep Dive

The physical layout of the hierarchy determines upper bounds on computation. The NVIDIA Hopper architecture provides 64K 32-bit registers per SM, with a hard architectural limit of 255 registers per individual thread. Shared Memory physically resides on the SM die. The Hopper H100 provides 228 KB of shared memory per SM, allowing up to 227 KB to be usable per thread block. The Blackwell B200 architecture (GB100, compute capability 10.0 / sm_100) retains 228 KB of shared memory per SM, identical to H100. Gaming Blackwell chips (GB202, CC 12.0) use a different configuration of 128 KB per SM.

A revolutionary addition in the Hopper architecture is Distributed Shared Memory (DSM). Thread blocks within a cooperative cluster (scaling up to 16 blocks) can directly read from and write to the shared memory of other blocks within the same cluster, completely bypassing the L2 cache and global memory.

Key Takeaways

Registers dictate the hard cap on thread concurrency: Hopper architectures hold 64K registers per SM, with a maximum of 255 per thread.
Shared memory capacity is 228 KB per SM in both H100 and B200 (GB100, CC 10.0). Gaming Blackwell (GB202, CC 12.0) uses 128 KB per SM — a distinct product line.
Distributed Shared Memory (DSM) provides inter-SM memory access at ~181 cycles, avoiding the ~300+ cycle penalty of HBM.
Maximizing Arithmetic Intensity requires locking data in the highest possible memory tier for the duration of the compute phase.