NVLink Memory Communication
NVLink is NVIDIA's proprietary, high-speed, wire-based serial interconnect that bypasses PCIe bottlenecks for GPU-to-GPU communication.
Source: mortalapps.com- NVLink is NVIDIA's proprietary, high-speed, wire-based serial interconnect that bypasses PCIe bottlenecks for GPU-to-GPU communication.
- The core purpose is to enable memory pooling across multiple GPUs, allowing a node or rack to act as a massive, unified GPU.
- The primary optimization idea is parallel multi-lane topologies utilizing NVSwitch to provide non-blocking all-to-all connectivity.
- The most important engineering insight is the massive scale gap: NVLink 5.0 delivers 1.8 TB/s per GPU, roughly 14x the bandwidth of PCIe Gen 5.
Why This Matters
Modern LLMs, such as GPT-4 or Llama-3-400B, simply cannot fit within the memory capacity of a single GPU. Tensor Parallelism and Pipeline Parallelism require splitting weights and activations across multiple GPUs. If this continuous inter-layer communication happens over PCIe, the communication overhead vastly outstrips the compute time. NVLink collapses this bottleneck, enabling near-linear scaling within a node or rack.
Core Intuition
If HBM is the GPU's immediate desk, and PCIe is the postal service, NVLink is a massive, dedicated highway system connecting the desks of 8 to 72 different workers. Because the highway is so extraordinarily wide, a worker on GPU 0 can reach into the desk of GPU 7 almost as fast as reaching into their own desk. This changes the programming model from "distributed compute" to "unified compute."
Technical Deep Dive
Unlike PCIe, NVLink operates transparently within the existing CUDA memory model, requiring no distinct API calls for basic access.
NVLink 4.0 (Hopper): Delivers 900 GB/s bidirectional bandwidth per GPU via 18 links.
NVLink 5.0 (Blackwell): Delivers 1.8 TB/s bidirectional bandwidth per GPU via 18 links running at approximately 50 GB/s per link per direction.
At the node level, individual NVLinks route into an NVSwitch. The 4th generation NVSwitch (for Hopper) supports an all-to-all topology with ~1s latency. For Blackwell, the 5th generation NVLink Switch scales this capability out to 72 GPUs (the GB200 NVL72 architecture). This provides an astounding 14.4 TB/s of aggregate non-blocking switching capacity per switch, yielding 130 TB/s aggregate bandwidth across the 72-GPU domain.
NVLink Generation
GPU Architecture
BW Per GPU
Aggregate Switch BW
3rd Generation
Ampere (A100)
600 GB/s
-
4th Generation
Hopper (H100)
900 GB/s
7.2 TB/s
5th Generation
Blackwell (B200)
1.8 TB/s
130 TB/s (NVL72)