GPU Memory Systems

Shared Memory Allocation and Bank Conflicts

Shared memory is divided into independent hardware banks to allow highly parallel, simultaneous access across threads within a warp.

Published June 1, 2026 · By MortalApps · 5 min read · ~874 words

TL;DR

Shared memory is divided into independent hardware banks to allow highly parallel, simultaneous access across threads within a warp.
The core purpose is to enable high-bandwidth, latency data sharing and communication between threads in a cooperative block.
The primary optimization idea is orchestrating thread memory access patterns to ensure uniform distribution across all banks, avoiding serialization.
The most important engineering insight is that an -way bank conflict reduces the effective shared memory bandwidth by a factor of , starving compute units.

Why This Matters Intuition Deep Dive Takeaways Related

Why This Matters

In heavy AI workloads like matrix multiplication (GEMM) and attention kernels, shared memory operates as the primary high-speed staging buffer feeding the Tensor Cores. If bank conflicts exist during the critical load from shared memory to registers (via the ldmatrix instruction), the warp execution pipeline stalls. This starvation leaves the multi-petaFLOP Tensor Cores idle, completely tanking overall system performance.

Core Intuition

Imagine 32 bank tellers (representing the memory banks) serving a line of 32 customers (representing a warp of threads). If every customer approaches a different teller, all 32 transactions complete simultaneously in a single cycle. If, however, all 32 customers require a document from the exact same teller, the transactions must be processed sequentially, taking 32 cycles. The hardware maps memory addresses to these banks interleaved at the word level to encourage natural distribution.

Technical Deep Dive

Shared memory is physically organized into 32 distinct banks, each of which is 4 bytes (32 bits) wide. The specific bank ID for a given byte address is determined by the hardware using the formula: .

Because consecutive 4-byte words are assigned to successive banks, a stride-1 access pattern of 32-bit float types by a warp perfectly distributes one memory access per bank. However, if threads access memory with a stride that is a multiple of 2 (for instance, reading a column-major matrix in a row-wise fashion), a 2-way bank conflict occurs. A stride of 32 bytes causes all threads to request data from the exact same bank, resulting in a massive 32-way conflict.

Key Takeaways

Shared memory comprises 32 distinct hardware banks, each 4 bytes wide, creating a repeating pattern every 128 bytes.

Bank conflicts force the hardware to serialize access, scaling latency linearly with the degree of the conflict.

Accessing memory with multiples-of-32 strides guarantees worst-case conflicts unless the exact same address is requested, which triggers a broadcast.

XOR swizzling and layout padding are the primary software defense mechanisms against throughput-destroying bank serialization.

A 32-way bank conflict effectively divides the theoretical shared memory bandwidth by 32. In modern deep learning workloads, shared memory throughput must vastly exceed 1 TB/s aggregate per SM just to keep the Tensor Cores fed. Serialization directly stalls the execution pipeline, creating latency spikes that cannot be hidden by asynchronous compute.

Access Stride

Bank Conflict Degree	Bandwidth Efficiency	Stride 1 (Linear)	No Conflict
100%	Stride 2	2-way Conflict	50%
Stride 8	8-way Conflict	12.5%	Stride 32

32-way Conflict

~3.1%

Why This Matters

Core Intuition

Technical Deep Dive

Key Takeaways

Internals / Execution Flow

Performance Implications

Real-World Usage

Common Bottlenecks

Optimization Strategies

Tools & Frameworks

Interview Guidance

Related Concepts