Shared Memory Allocation and Bank Conflicts
Shared memory is divided into independent hardware banks to allow highly parallel, simultaneous access across threads within a warp.
Source: mortalapps.com- Shared memory is divided into independent hardware banks to allow highly parallel, simultaneous access across threads within a warp.
- The core purpose is to enable high-bandwidth,
latency data sharing and communication between threads in a cooperative block.
- The primary optimization idea is orchestrating thread memory access patterns to ensure uniform distribution across all banks, avoiding serialization.
- The most important engineering insight is that an
-way bank conflict reduces the effective shared memory bandwidth by a factor of
, starving compute units.
Why This Matters
In heavy AI workloads like matrix multiplication (GEMM) and attention kernels, shared memory operates as the primary high-speed staging buffer feeding the Tensor Cores. If bank conflicts exist during the critical load from shared memory to registers (via the ldmatrix instruction), the warp execution pipeline stalls. This starvation leaves the multi-petaFLOP Tensor Cores idle, completely tanking overall system performance.
Core Intuition
Imagine 32 bank tellers (representing the memory banks) serving a line of 32 customers (representing a warp of threads). If every customer approaches a different teller, all 32 transactions complete simultaneously in a single cycle. If, however, all 32 customers require a document from the exact same teller, the transactions must be processed sequentially, taking 32 cycles. The hardware maps memory addresses to these banks interleaved at the word level to encourage natural distribution.
Technical Deep Dive
Shared memory is physically organized into 32 distinct banks, each of which is 4 bytes (32 bits) wide. The specific bank ID for a given byte address is determined by the hardware using the formula:
.
Because consecutive 4-byte words are assigned to successive banks, a stride-1 access pattern of 32-bit float types by a warp perfectly distributes one memory access per bank. However, if threads access memory with a stride that is a multiple of 2 (for instance, reading a column-major matrix in a row-wise fashion), a 2-way bank conflict occurs. A stride of 32 bytes causes all threads to request data from the exact same bank, resulting in a massive 32-way conflict.