Tensor Layouts and Memory Ordering
Tensor memory is logically multi-dimensional but physically linear. HBM accesses are transacted in 128-byte cache lines.
Source: mortalapps.com- Tensor memory is logically multi-dimensional but physically linear.
- HBM accesses are transacted in 128-byte cache lines.
- Shared memory (SMEM) is divided into 32 banks; unoptimized columnar access causes severe bank conflicts.
- Core optimization: "Swizzling" (bitwise XOR) reorganizes data to ensure conflict-free SMEM accesses.
Why This Matters
If a thread block requests a column of data from a row-major tensor in SMEM, the hardware serializes the request across the conflicting memory banks, degrading bandwidth by up to 32x. AI infrastructure engineers must guarantee layout alignment to keep Tensor Cores fed; unmatched layouts cripple teraFLOPS output regardless of compute power.
Core Intuition
Imagine 32 tellers at a bank (the 32 SMEM banks). If 32 customers (the warp threads) all need to speak to teller #1, 31 customers must wait in line (a bank conflict). If each customer speaks to a different teller, all transactions happen simultaneously. Swizzling is the algorithmic algorithm to distribute the customers evenly across tellers, regardless of whether they arrive in a row or a column.
Technical Deep Dive
Modern NVIDIA GPUs organize shared memory into 32 banks, each 4 bytes wide. Successive 32-bit words are assigned to successive banks. When threads within a half-warp access addresses mapping to the same bank, conflicts occur. To load a matrix into SMEM and read it transposes (e.g., TN GEMM layout), standard indexing fails. Swizzling applies a bitwise XOR operation between the row index and the column index bits. For a 128-byte configuration, the pattern repeats after 1024 bytes. Libraries like CuTe allow developers to specify patterns like Swizzle<3, 3, 3> to permute the layout and guarantee mathematically conflict-free memory retrieval at compile-time.