Register Spilling and Resource Exhaustion
Register spilling occurs when a thread requires more variables than the physical register file can hold, forcing the compiler to offload data.
Source: mortalapps.com- Register spilling occurs when a thread requires more variables than the physical register file can hold, forcing the compiler to offload data.
- The core purpose of managing it is preserving the single-cycle access latency of registers to keep arithmetic units saturated.
- The primary optimization idea is reducing variable lifespans and utilizing shared memory as an intermediate spill fallback.
- The most important engineering insight is that "local memory" is physically located in the slow Global Memory (HBM), making spilling a catastrophic performance cliff.
Why This Matters
Registers are the fastest, most abundant bandwidth resource on the GPU (aggregate >100 TB/s). However, they are strictly limited to 255 per thread (or 64K per SM). When an intricate AI kernel (like an unrolled GEMM loop) hits this limit, the ptxas compiler silently generates spill loads/stores. This routes accesses that should take 1 cycle to L1/L2 caches or VRAM (hundreds of cycles), plummeting compute throughput and destroying kernel efficiency.
Core Intuition
Imagine a master mechanic with a toolbelt holding exactly 255 tools. If the job requires 300 tools, the mechanic must put 45 tools in a toolbox at the back of the garage (Global Memory). Every time they need one of those 45 tools, they must walk across the garage to swap it with a tool currently in their belt. This walking destroys their efficiency, regardless of how fast they actually use the tools.
Technical Deep Dive
Each Hopper SM contains 64K 32-bit registers. When ptxas compiles an excessively complex kernel, it allocates a stack frame in Local Memory (a thread-private abstraction mapped to Global Memory physically). Historically, spilling directly targeted this high-latency local memory. In CUDA 13.0, NVIDIA introduced a critical opt-in optimization via the inline assembly .pragma enable_smem_spilling. This allows the compiler to utilize high-bandwidth, low-latency Shared Memory as the primary backing storage for spilled registers, only falling back to local memory if Shared Memory is completely exhausted.