← Infrastructure Tensor Computing
Infrastructure

Tensor Layouts and Memory Ordering

Tensor memory is logically multi-dimensional but physically linear. HBM accesses are transacted in 128-byte cache lines.

Source: mortalapps.com
TL;DR
  • Tensor memory is logically multi-dimensional but physically linear.
  • HBM accesses are transacted in 128-byte cache lines.
  • Shared memory (SMEM) is divided into 32 banks; unoptimized columnar access causes severe bank conflicts.
  • Core optimization: "Swizzling" (bitwise XOR) reorganizes data to ensure conflict-free SMEM accesses.

Why This Matters

If a thread block requests a column of data from a row-major tensor in SMEM, the hardware serializes the request across the conflicting memory banks, degrading bandwidth by up to 32x. AI infrastructure engineers must guarantee layout alignment to keep Tensor Cores fed; unmatched layouts cripple teraFLOPS output regardless of compute power.

Core Intuition

Imagine 32 tellers at a bank (the 32 SMEM banks). If 32 customers (the warp threads) all need to speak to teller #1, 31 customers must wait in line (a bank conflict). If each customer speaks to a different teller, all transactions happen simultaneously. Swizzling is the algorithmic algorithm to distribute the customers evenly across tellers, regardless of whether they arrive in a row or a column.

Technical Deep Dive

Modern NVIDIA GPUs organize shared memory into 32 banks, each 4 bytes wide. Successive 32-bit words are assigned to successive banks. When threads within a half-warp access addresses mapping to the same bank, conflicts occur. To load a matrix into SMEM and read it transposes (e.g., TN GEMM layout), standard indexing fails. Swizzling applies a bitwise XOR operation between the row index and the column index bits. For a 128-byte configuration, the pattern repeats after 1024 bytes. Libraries like CuTe allow developers to specify patterns like Swizzle<3, 3, 3> to permute the layout and guarantee mathematically conflict-free memory retrieval at compile-time.

Key Takeaways

HBM demands contiguous 128-byte reads (coalescing).
SMEM demands divergent bank accesses (no conflicts).
Swizzling bridges this gap using XOR math on memory pointers.
Layout math should be resolved at compile time via libraries like CuTe.