Tensor Computing

Cooperative Groups and Distributed Shared Memory

Thread Block Clusters guarantee co-scheduling of multiple blocks on the same GPU Processing Cluster (GPC).

Published June 1, 2026 · By MortalApps · 3 min read · ~503 words

TL;DR

Thread Block Clusters guarantee co-scheduling of multiple blocks on the same GPU Processing Cluster (GPC).
Clusters unlock Distributed Shared Memory (DSMEM).
Allows Block A to directly read/write Block B's Shared Memory over a high-speed intra-cluster network, bypassing HBM.
Max portable size is 8 blocks; Blackwell allows an opt-in of 16 blocks.

Why This Matters Intuition Deep Dive Takeaways Related

Why This Matters

In standard CUDA, thread blocks are strictly isolated. To share data, they must write to slow Global Memory (HBM). By grouping blocks into Clusters, AI engineers can route intermediate data (like partial KV-cache reductions or softmax denominators) directly between SMs via SRAM, saving terabytes of global memory bandwidth.

Core Intuition

Standard CUDA blocks are like separate islands; to trade goods, they must load them onto slow cargo ships (Global Memory). A Thread Block Cluster builds high-speed bridges (DSMEM network) between adjacent islands, allowing them to instantly trade goods (Shared Memory) without ever using the ships.

Technical Deep Dive

Introduced in Hopper and expanded in Blackwell, a Thread Block Cluster is an optional hierarchy layer. When launched, the hardware scheduler guarantees these blocks are co-scheduled simultaneously on the same GPC. This co-scheduling allows the hardware to map all cluster members' shared memory into a unified Distributed Shared Memory (DSMEM) address space. A thread in Block 0 can calculate a pointer into the SMEM of Block 1 using the Cooperative Groups API: cluster.map_shared_rank(SMEM, DST_BLOCK_RANK).

Key Takeaways

Clusters group up to 8 (portable) or 16 (Blackwell opt-in) thread blocks.

Clusters guarantee simultaneous co-scheduling.

DSMEM enables direct SM-to-SM memory access over a dedicated network.

Solves SMEM capacity limits by pooling SRAM across the cluster.

Why This Matters

Core Intuition

Technical Deep Dive

Key Takeaways

Internals / Execution Flow

Performance Implications

Real-World Usage

Common Bottlenecks

Optimization Strategies

Tools & Frameworks

Interview Guidance

Related Concepts