AI Networking

NVLink and NVSwitch Systems

NVLink is a proprietary, wire-based serial communications protocol specifically engineered to bypass the PCIe bus, providing massive GPU-to-GPU bandwidth.

Published June 1, 2026 · By MortalApps · 6 min read · ~1,073 words

TL;DR

NVLink is a proprietary, wire-based serial communications protocol specifically engineered to bypass the PCIe bus, providing massive GPU-to-GPU bandwidth.
The sixth-generation NVLink (Rubin architecture) delivers an unprecedented 3.6 TB/s of bidirectional bandwidth per GPU, more than fourteen times the bandwidth of PCIe Gen6.
NVSwitch transforms NVLink from a basic point-to-point mesh into a packet-switched, fully connected, non-blocking L1 network domain.
The NVSwitch silicon incorporates SHARP hardware accelerators, executing In-Network Computing by performing mathematical reductions directly inside the network fabric.

Why This Matters Intuition Deep Dive Takeaways Related

Why This Matters

Modern artificial intelligence model architectures, notably the Mixture of Experts (MoE) paradigm, inherently demand massive all-to-all communication phases during token dispatch. Standard PCIe lanes ruthlessly bottleneck these transfers, forcing multi-thousand-dollar GPUs to idle while waiting for straggling data packets. NVLink and NVSwitch obliterate this hardware barrier, allowing systems to unify disparate GPU memory pools into a single, highly coherent address space. This architecture enables exascale inference and synchronized distributed training without the devastating performance penalties associated with traditional interconnects.

Core Intuition

Conceptualize PCIe as a standardized municipal highway heavily regulated by traffic lights and central dispatch logic (the CPU and Root Complex). In contrast, NVLink is a dedicated, multi-lane bullet train network connecting highly specific high-priority destinations directly to one another. While the initial iterations of NVLink provided direct point-to-point "rails" between adjacent accelerators, the introduction of the NVSwitch ASIC created a central routing hub. This hub permits any GPU to communicate with any other GPU in the rack at hardware line rate without blocking or interfering with adjacent traffic flows.

Technical Deep Dive

The NVLink protocol is constructed upon high-speed signaling interconnects (NVHS) that leverage finely tuned differential pairs. Its bandwidth evolution demonstrates aggressive scaling: NVLink 1.0 (Pascal architecture, 2016) provided 160 GB/s. NVLink 3.0 (Ampere) reached 600 GB/s. NVLink 5.0 (Blackwell) doubled speeds to 1.8 TB/s. Most recently, NVLink 6.0 (Rubin) achieves a staggering 3.6 TB/s per GPU.

Since 2018, NVSwitch has served as the critical backbone for scale-up architectures. A single NVSwitch ASIC provides multi-port packet switching logic. In rack-scale deployments like the GB300 NVL72, multiple NVSwitch chips are cascaded together across a passive copper backplane to generate 130 TB/s of aggregate non-blocking bandwidth spanning 72 GPUs. Crucially, NVLink facilitates coherent shared memory across processors, allowing a GPU to execute direct read or write instructions to peer GPU HBM or host CPU memory (via NVLink C2C) completely seamlessly.

Generation	Max Bidirectional Bandwidth	vs. PCIe Equivalent	PCIe Gen 5 x16
63.0 GB/s	Baseline	NVLink 3.0 (Ampere)	600 GB/s
~9.5x	NVLink 4.0 (Hopper)	900 GB/s	~14.2x
NVLink 5.0 (Blackwell)	1,800 GB/s	~28.5x	NVLink 6.0 (Rubin)

3,600 GB/s

~57.1x 20

Key Takeaways

NVLink shatters the legacy PCIe bottleneck, delivering multi-TB/s of GPU-to-GPU bandwidth.

NVSwitch transforms basic point-to-point links into a fully non-blocking, packet-switched fabric.

In-Network Computing via SHARP inside the NVSwitch drastically accelerates collective operations by performing math in the network.

NVLink topologies dictate precisely where Tensor Parallelism software boundaries must be drawn to avoid catastrophic performance cliffs.

The Rubin architecture scales NVLink to 3.6 TB/s per GPU, continuing an aggressive trajectory of doubling interconnect speed.

Why This Matters

Core Intuition

Technical Deep Dive

Key Takeaways

Internals / Execution Flow

Performance Implications

Real-World Usage

Common Bottlenecks

Optimization Strategies

Tools & Frameworks

Interview Guidance

Related Concepts