Cache-Aware Tiling Strategies
Cache-aware tiling decomposes massive multi-dimensional matrices into smaller geometric tiles that fit perfectly within L1/Shared Memory.
Source: mortalapps.com- Cache-aware tiling decomposes massive multi-dimensional matrices into smaller geometric tiles that fit perfectly within L1/Shared Memory.
- The core purpose is to maximize memory reuse, fetching data from slow HBM once and executing
math operations on it before eviction.
- The primary optimization idea is balancing the dimensions of the tile (
) against register file limits and shared memory capacity.
- The most important engineering insight is that the ratio of Compute (FLOPs) to Memory Access (Bytes) scales with tile size, making larger tiles definitively better—until they spill out of SRAM.
Why This Matters
Standard matrix multiplication () without tiling performs
operations using
memory accesses, making it fundamentally and unavoidably memory-bound. Cache-aware tiling reduces the memory access complexity to
, dramatically elevating the operation's Arithmetic Intensity. Revolutionary algorithms like FlashAttention are entirely predicated on exactly this concept.
Core Intuition
Imagine painting a massive billboard. If you dip your brush in a bucket on the ground, climb the ladder, paint a stroke, and climb down, you waste all your time climbing. Tiling is carrying a small palette up the ladder. You fill the palette (the tile), climb up, and paint hundreds of strokes (the math) until the palette is dry. The size of the palette is strictly limited by how much you can physically hold (Shared Memory/Registers).
Technical Deep Dive
A dense GEMM kernel iteratively loads a tile of matrix (shape
) and a tile of matrix
(shape
) into Shared Memory. The physical capacity of the hardware dictates the maximum limits: Hopper H100 provides 228 KB of shared memory per SM (max 227 KB usable per thread block). Blackwell B200 (GB100) retains 228 KB per SM, matching H100. Tile sizes are governed by the same 228 KB physical limit. To compute the output tile
(
), each thread claims a sub-tile, moving values from Shared Memory into its Registers for calculation.