Cache Hierarchies and Hit Rate Optimization
GPU cache hierarchies consist of highly localized L1 caches per SM and a massive shared global L2 cache across all SMs on the die.
Source: mortalapps.com- GPU cache hierarchies consist of highly localized L1 caches per SM and a massive shared global L2 cache across all SMs on the die.
- The core purpose is buffering expensive global memory accesses and handling high-speed inter-SM communication and atomic operations.
- The primary optimization idea is explicitly controlling cache eviction policies at the PTX level to persist high-value data blocks.
- The most important engineering insight is that the L2 cache can be programmatically partitioned to lock critical tensors in place, bypassing VRAM latency for reuse.
Why This Matters
With modern VRAM (HBM) latencies hovering around 300 to 500 cycles, and L2 cache latencies sitting around 150 to 200 cycles, optimizing cache hit rates provides a massive, non-linear performance boost. For workloads like deep learning where the exact same weight blocks or attention keys are reused repeatedly across different thread blocks, controlling caching dynamics directly dictates the execution time and power consumption of the kernel.
Core Intuition
Think of the L2 cache as the town square of the GPU. All SMs (neighborhoods) connect to it. It is massive (e.g., approximately 50MB in Hopper) but fiercely contested by competing workloads. Standard caching uses an LRU (Least Recently Used) policy, meaning incredibly valuable data is frequently evicted by massive, one-off streaming operations. Hit rate optimization is the deliberate art of telling the GPU hardware exactly which data is temporary and which data must be held permanently.
Technical Deep Dive
Starting in CUDA 11.0 and heavily refined in the Hopper and Blackwell architectures, NVIDIA exposes programmatic L2 cache control via the cudaAccessPropertyPersisting and cudaAccessPropertyStreaming runtime flags. The L2 cache can be dynamically partitioned into a "set-aside" area designated exclusively for persisting data. A developer reserves a specific byte-size subset of the total L2 capacity using the cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, size) API. When a memory pointer is annotated with the persisting property, the hardware preferentially maps its cache lines into this set-aside region, shielding them from being evicted by standard streaming loads.