GPU Memory Systems

Cache Hierarchies and Hit Rate Optimization

GPU cache hierarchies consist of highly localized L1 caches per SM and a massive shared global L2 cache across all SMs on the die.

Published June 1, 2026 · By MortalApps · 5 min read · ~815 words

TL;DR

GPU cache hierarchies consist of highly localized L1 caches per SM and a massive shared global L2 cache across all SMs on the die.
The core purpose is buffering expensive global memory accesses and handling high-speed inter-SM communication and atomic operations.
The primary optimization idea is explicitly controlling cache eviction policies at the PTX level to persist high-value data blocks.
The most important engineering insight is that the L2 cache can be programmatically partitioned to lock critical tensors in place, bypassing VRAM latency for reuse.

Why This Matters Intuition Deep Dive Takeaways Related

Why This Matters

With modern VRAM (HBM) latencies hovering around 300 to 500 cycles, and L2 cache latencies sitting around 150 to 200 cycles, optimizing cache hit rates provides a massive, non-linear performance boost. For workloads like deep learning where the exact same weight blocks or attention keys are reused repeatedly across different thread blocks, controlling caching dynamics directly dictates the execution time and power consumption of the kernel.

Core Intuition

Think of the L2 cache as the town square of the GPU. All SMs (neighborhoods) connect to it. It is massive (e.g., approximately 50MB in Hopper) but fiercely contested by competing workloads. Standard caching uses an LRU (Least Recently Used) policy, meaning incredibly valuable data is frequently evicted by massive, one-off streaming operations. Hit rate optimization is the deliberate art of telling the GPU hardware exactly which data is temporary and which data must be held permanently.

Technical Deep Dive

Starting in CUDA 11.0 and heavily refined in the Hopper and Blackwell architectures, NVIDIA exposes programmatic L2 cache control via the cudaAccessPropertyPersisting and cudaAccessPropertyStreaming runtime flags. The L2 cache can be dynamically partitioned into a "set-aside" area designated exclusively for persisting data. A developer reserves a specific byte-size subset of the total L2 capacity using the cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, size) API. When a memory pointer is annotated with the persisting property, the hardware preferentially maps its cache lines into this set-aside region, shielding them from being evicted by standard streaming loads.

Key Takeaways

L2 cache operations and eviction policies can be programmatically controlled via Access Property flags.

cudaLimitPersistingL2CacheSize safely reserves a dedicated physical partition in the L2 cache to prevent thrashing.

Setting cudaAccessPropertyNormal is mandatory post-execution to avoid permanent cache starvation for subsequent workloads.

L1 cache hits fetch 128 bytes, while L2-only hits fetch 32 bytes, establishing a critical distinction for scattered access patterns.

Why This Matters

Core Intuition

Technical Deep Dive

Key Takeaways

Internals / Execution Flow

Performance Implications

Real-World Usage

Common Bottlenecks

Optimization Strategies

Tools & Frameworks

Interview Guidance

Related Concepts