Attention Block Tiling Strategies
Tiling decomposes the massive attention matrix into smaller blocks that fit precisely within the SM's ultra-fast SRAM.
Source: mortalapps.com- Tiling decomposes the massive
attention matrix into smaller blocks (
) that fit precisely within the SM's ultra-fast SRAM.
- The fundamental enabler of tiling is the "Online Softmax" algorithm, which mathematically reformulates softmax to execute dynamically in blocks.
- Block sizes are dynamically calculated based on sequence length, head dimension, and available hardware SRAM capacity.
- Effective tiling reduces memory accesses from an intractable
to a highly efficient
, where is SRAM size.
Why This Matters
Global memory (HBM) is agonizingly slow compared to the compute capacity of Tensor Cores. Without tiling, attention requires transferring the full attention matrix multiple times across this slow bus. By sizing blocks perfectly to the GPU's Shared Memory (SRAM)—which delivers tens of terabytes per second in bandwidth—infrastructure engineers can bypass the notorious "Memory Wall," allowing attention to scale linearly in memory complexity and unlocking large context windows.
Core Intuition
Imagine trying to calculate the average height of a stadium of 100,000 people, but your notebook can only hold 100 names at a time. Instead of pulling all 100,000 names, computing an intermediate number, walking out of the stadium to write it on a giant whiteboard, and repeating, you keep a running tally (a local maximum and a denominator) in your small notebook and update it block by block. Tiling does exactly this for GPUs. It slices the Q, K, and V matrices into chunks that fit the SRAM "notebook," performing local attention and mathematically adjusting it into global attention on the fly.
Technical Deep Dive
The core mechanism involves setting row block size and column block size
such that the required tiles from
, and the intermediate output
fit strictly into the available SRAM size
. The formulas derive to
and
. The mathematical breakthrough enabling this is Online Softmax. Standard softmax requires two complete passes over the data to find the global maximum and the sum of exponentials
. Online Softmax computes these variables dynamically block by block 4:
and
. By applying the scaling factor
to the previously accumulated denominator and the previous output block, the algorithm correctly patches the local softmax into a globally accurate softmax without ever materializing the full row.
, then immediately updates
. The accumulator writes back to SRAM, eventually flushing the final globally accurate