← Infrastructure AI Networking
Infrastructure

InfiniBand NDR Networks

InfiniBand NDR (Next Data Rate) powers 400 Gb/s per-port networks, doubling the bandwidth of the previous HDR generation and serving as the backbone of

Source: mortalapps.com
TL;DR
  • InfiniBand NDR (Next Data Rate) powers 400 Gb/s per-port networks, doubling the bandwidth of the previous HDR generation and serving as the backbone of modern supercomputing.
  • The NVIDIA Quantum-2 switch system provides 64 non-blocking ports of 400 Gb/s, yielding a massive 51.2 Tb/s of aggregate throughput per 1U chassis.
  • NDR heavily integrates advanced In-Network Computing, utilizing the SHARPv3 protocol to offload mathematical collective operations directly into the switch silicon.
  • The succeeding XDR (Quantum-3) generation pushes this envelope even further to 800 Gb/s per port, integrating co-packaged silicon photonics to reduce latency.

Why This Matters

As AI models demand petabytes of data exchange during the training phase, the network fabric cannot simply act as a passive pipe; it must actively accelerate the workload. InfiniBand provides highly deterministic, ultra-low latency, absolutely lossless transmission, and hardware-accelerated adaptive routing. NDR 400G and XDR 800G networks form the backbones of the world's most powerful AI supercomputers, enabling the linear scaling of training efficiency across tens of thousands of GPUs.

Core Intuition

Standard Ethernet was originally designed for noisy, lossy, unpredictable internet traffic. InfiniBand, conversely, was engineered from the ground up specifically for tightly coupled supercomputing. It utilizes a strict credit-based flow control mechanism that absolutely guarantees lossless transmission—an InfiniBand switch will not transmit a single packet unless it mathematically knows the downstream receiver has available buffer space. Furthermore, the inclusion of In-Network Computing (SHARP) turns the network switch from a simple traffic intersection into a mathematical co-processor that computes data as it routes it.

Technical Deep Dive

The Quantum-2 NDR Architecture (QM9700/QM9790 switch systems) utilizes 32 OSFP (Octal Small Form-factor Pluggable) physical connectors. Because NDR transceivers are heavily engineered as twin-port devices, these 32 physical cages provide 64 distinct 400 Gb/s ports, creating a 51.2 Tb/s non-blocking switching capacity. A single Quantum-2 switch handles over 66.5 billion packets per second (BPPS).

The subsequent Quantum-3 XDR Architecture (Quantum-X800) scales the fabric to 144 ports of 800 Gb/s per switch, integrating advanced co-packaged silicon photonics to reduce both latency and power consumption by minimizing the physical distances electrical signals must travel before converting to light.

InfiniBand GenerationPer-Port BandwidthSwitch ThroughputKey In-Network Feature
HDR (Quantum)200 Gb/s16 Tb/sSHARPv2
NDR (Quantum-2)400 Gb/s51.2 Tb/sSHARPv3 38
XDR (Quantum-X800)800 Gb/s115.2 Tb/sSHARPv4 37

Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) allows the switch silicon to perform mathematical reductions (e.g., summing gradients). Multiple switches coordinate to aggregate data as it physically moves up the network tree, sending only a single reduced payload back down. This outright eliminates the massive network incast congestion typical of traditional AllReduce software implementations.

Key Takeaways

InfiniBand NDR provides 400 Gb/s per port, while XDR scales to 800 Gb/s.
The architecture ensures strictly lossless data transmission via credit-based flow control.
SHARP technology offloads mathematical reductions directly into the switch silicon, drastically reducing network traffic during synchronized training.
Co-packaged silicon photonics in the Quantum-X800 mitigate the severe power and distance limitations of high-speed copper interconnects.