GPU Memory Systems

Cache-Aware Tiling Strategies

Cache-aware tiling decomposes massive multi-dimensional matrices into smaller geometric tiles that fit perfectly within L1/Shared Memory.

Published June 1, 2026 · By MortalApps · 5 min read · ~844 words

TL;DR

Cache-aware tiling decomposes massive multi-dimensional matrices into smaller geometric tiles that fit perfectly within L1/Shared Memory.
The core purpose is to maximize memory reuse, fetching data from slow HBM once and executing math operations on it before eviction.
The primary optimization idea is balancing the dimensions of the tile () against register file limits and shared memory capacity.
The most important engineering insight is that the ratio of Compute (FLOPs) to Memory Access (Bytes) scales with tile size, making larger tiles definitively better—until they spill out of SRAM.

Why This Matters Intuition Deep Dive Takeaways Related

Why This Matters

Standard matrix multiplication () without tiling performs operations using memory accesses, making it fundamentally and unavoidably memory-bound. Cache-aware tiling reduces the memory access complexity to , dramatically elevating the operation's Arithmetic Intensity. Revolutionary algorithms like FlashAttention are entirely predicated on exactly this concept.

Core Intuition

Imagine painting a massive billboard. If you dip your brush in a bucket on the ground, climb the ladder, paint a stroke, and climb down, you waste all your time climbing. Tiling is carrying a small palette up the ladder. You fill the palette (the tile), climb up, and paint hundreds of strokes (the math) until the palette is dry. The size of the palette is strictly limited by how much you can physically hold (Shared Memory/Registers).

Technical Deep Dive

A dense GEMM kernel iteratively loads a tile of matrix (shape ) and a tile of matrix (shape ) into Shared Memory. The physical capacity of the hardware dictates the maximum limits: Hopper H100 provides 228 KB of shared memory per SM (max 227 KB usable per thread block). Blackwell B200 (GB100) retains 228 KB per SM, matching H100. Tile sizes are governed by the same 228 KB physical limit. To compute the output tile (), each thread claims a sub-tile, moving values from Shared Memory into its Registers for calculation.

Key Takeaways

Tiling increases Arithmetic Intensity, transforming memory-bound kernels into compute-bound kernels.

Tile sizes are strictly bounded by physical hardware limits: 228 KB SMEM in both H100 and B200 (GB100, CC 10.0).

Tiling necessitates rigorously managing the staging flow from Global Memory → Shared Memory → Registers.

If a tile exceeds local bounds, Hopper's Distributed Shared Memory can cluster up to 16 SMs to handle the payload.

Why This Matters

Core Intuition

Technical Deep Dive

Key Takeaways

Internals / Execution Flow

Performance Implications

Real-World Usage

Common Bottlenecks

Optimization Strategies

Tools & Frameworks

Interview Guidance

Related Concepts