GPU Memory Systems

NVLink Memory Communication

NVLink is NVIDIA's proprietary, high-speed, wire-based serial interconnect that bypasses PCIe bottlenecks for GPU-to-GPU communication.

Published June 1, 2026 · By MortalApps · 5 min read · ~802 words

TL;DR

NVLink is NVIDIA's proprietary, high-speed, wire-based serial interconnect that bypasses PCIe bottlenecks for GPU-to-GPU communication.
The core purpose is to enable memory pooling across multiple GPUs, allowing a node or rack to act as a massive, unified GPU.
The primary optimization idea is parallel multi-lane topologies utilizing NVSwitch to provide non-blocking all-to-all connectivity.
The most important engineering insight is the massive scale gap: NVLink 5.0 delivers 1.8 TB/s per GPU, roughly 14x the bandwidth of PCIe Gen 5.

Why This Matters Intuition Deep Dive Takeaways Related

Why This Matters

Modern LLMs, such as GPT-4 or Llama-3-400B, simply cannot fit within the memory capacity of a single GPU. Tensor Parallelism and Pipeline Parallelism require splitting weights and activations across multiple GPUs. If this continuous inter-layer communication happens over PCIe, the communication overhead vastly outstrips the compute time. NVLink collapses this bottleneck, enabling near-linear scaling within a node or rack.

Core Intuition

If HBM is the GPU's immediate desk, and PCIe is the postal service, NVLink is a massive, dedicated highway system connecting the desks of 8 to 72 different workers. Because the highway is so extraordinarily wide, a worker on GPU 0 can reach into the desk of GPU 7 almost as fast as reaching into their own desk. This changes the programming model from "distributed compute" to "unified compute."

Technical Deep Dive

Unlike PCIe, NVLink operates transparently within the existing CUDA memory model, requiring no distinct API calls for basic access.

NVLink 4.0 (Hopper): Delivers 900 GB/s bidirectional bandwidth per GPU via 18 links.

NVLink 5.0 (Blackwell): Delivers 1.8 TB/s bidirectional bandwidth per GPU via 18 links running at approximately 50 GB/s per link per direction.

At the node level, individual NVLinks route into an NVSwitch. The 4th generation NVSwitch (for Hopper) supports an all-to-all topology with ~1s latency. For Blackwell, the 5th generation NVLink Switch scales this capability out to 72 GPUs (the GB200 NVL72 architecture). This provides an astounding 14.4 TB/s of aggregate non-blocking switching capacity per switch, yielding 130 TB/s aggregate bandwidth across the 72-GPU domain.

NVLink Generation

GPU Architecture

BW Per GPU

Aggregate Switch BW

3rd Generation

Ampere (A100)

600 GB/s

4th Generation

Hopper (H100)

900 GB/s

7.2 TB/s

5th Generation

Blackwell (B200)

1.8 TB/s

130 TB/s (NVL72)

Key Takeaways

NVLink 5.0 (Blackwell) provides 1.8 TB/s per GPU, vastly outperforming PCIe Gen 5's 128 GB/s.

NVSwitch establishes a non-blocking all-to-all topology equipped with SHARP in-network reduction capabilities.

The Blackwell NVL72 architecture radically extends the NVLink domain from 8 GPUs to 72 GPUs, yielding 130 TB/s aggregate bandwidth.

Why This Matters

Core Intuition

Technical Deep Dive

Key Takeaways

Internals / Execution Flow

Performance Implications

Real-World Usage

Common Bottlenecks

Optimization Strategies

Tools & Frameworks

Interview Guidance

Related Concepts