← Infrastructure Transformer Systems
Infrastructure

Attention Block Tiling Strategies

Tiling decomposes the massive attention matrix into smaller blocks that fit precisely within the SM's ultra-fast SRAM.

Source: mortalapps.com
TL;DR
  • Tiling decomposes the massive attention matrix into smaller blocks () that fit precisely within the SM's ultra-fast SRAM.
  • The fundamental enabler of tiling is the "Online Softmax" algorithm, which mathematically reformulates softmax to execute dynamically in blocks.
  • Block sizes are dynamically calculated based on sequence length, head dimension, and available hardware SRAM capacity.
  • Effective tiling reduces memory accesses from an intractable to a highly efficient , where is SRAM size.

Why This Matters

Global memory (HBM) is agonizingly slow compared to the compute capacity of Tensor Cores. Without tiling, attention requires transferring the full attention matrix multiple times across this slow bus. By sizing blocks perfectly to the GPU's Shared Memory (SRAM)—which delivers tens of terabytes per second in bandwidth—infrastructure engineers can bypass the notorious "Memory Wall," allowing attention to scale linearly in memory complexity and unlocking large context windows.

Core Intuition

Imagine trying to calculate the average height of a stadium of 100,000 people, but your notebook can only hold 100 names at a time. Instead of pulling all 100,000 names, computing an intermediate number, walking out of the stadium to write it on a giant whiteboard, and repeating, you keep a running tally (a local maximum and a denominator) in your small notebook and update it block by block. Tiling does exactly this for GPUs. It slices the Q, K, and V matrices into chunks that fit the SRAM "notebook," performing local attention and mathematically adjusting it into global attention on the fly.

Technical Deep Dive

The core mechanism involves setting row block size and column block size such that the required tiles from , and the intermediate output fit strictly into the available SRAM size . The formulas derive to and . The mathematical breakthrough enabling this is Online Softmax. Standard softmax requires two complete passes over the data to find the global maximum and the sum of exponentials . Online Softmax computes these variables dynamically block by block 4: and . By applying the scaling factor to the previously accumulated denominator and the previous output block, the algorithm correctly patches the local softmax into a globally accurate softmax without ever materializing the full row.

Key Takeaways

Tiling decomposes attention into SRAM-sized chunks to bypass HBM bandwidth limits.
Online Softmax is the crucial mathematical trick that makes tiling mathematically exact and stateful.
Performance scales directly with SRAM capacity; larger SRAM allows larger tiles and fewer memory passes.
High head dimensions aggressively constrain block sizes, acting as a natural physical ceiling on model architecture choices.