Cross-Rack GPU Communication
Cross-rack communication fundamentally relies on scale-out fabrics (InfiniBand or Ethernet) organized in Clos (Leaf-Spine) topologies, dropping bandwidth
Source: mortalapps.com- Cross-rack communication fundamentally relies on scale-out fabrics (InfiniBand or Ethernet) organized in Clos (Leaf-Spine) topologies, dropping bandwidth drastically compared to intra-rack NVLink.
- Managing cross-rack traffic is critical; collective algorithms strive to localize communication to explicitly avoid "spine crossings" which introduce severe latency and congestion.
- Network designs commonly employ Rail-Optimized architectures, mapping specific GPU ranks to specific leaf switches to contain intra-rail communication to a single hop.
- Hyperscalers meticulously isolate compute traffic (the backend) from storage and management traffic (the frontend) into physically separate network fabrics to prevent interference.
Why This Matters
While a single densely packed rack like the GB200 NVL72 provides a massive 130 TB/s of bandwidth, frontier models trained on 100,000 GPUs stretch across thousands of physical racks. When communication crosses the optical transceivers into the Top-of-Rack (ToR) and Spine switches, bandwidth drops precipitously from Terabytes per second to Gigabytes per second. If software parallelization strategies (like Tensor Parallelism) inadvertently spill over across racks, the entire cluster throttles down to the speed of the optical fabric, destroying training efficiency and wasting millions in capital expenditure.
Core Intuition
Imagine a group of specialized workers in a single office room (a rack). They can communicate with each other instantly (NVLink). To talk to another office across the campus (cross-rack), they must write a formal letter, hand it to a mailroom (Leaf Switch), which sends it to a central sorting facility (Spine Switch), which finally routes it to the target office. The overarching goal of cluster communication design is to ensure that intense, high-volume brainstorming happens exclusively within the office, while only finalized, infrequent summaries are mailed across campus.
Technical Deep Dive
Cross-rack fabrics are constructed utilizing multi-tier Clos topologies. Leaf Switches serve as the physical entry point for the node. In standard ToR designs, all GPUs in a server connect to a single switch. Spine Switches connect the leaves together to provide an any-to-any non-blocking or minimally oversubscribed mesh. To maintain exceptionally low latency, cross-rack communication relies on RDMA protocols (InfiniBand or RoCEv2), executing GPUDirect memory transfers so payloads never touch the host CPU.
Because optical links (AOCs and transceivers) are expensive and failure-prone at massive scales, network oversubscription is common at the spine tier. A 2:1 oversubscription means the leaf switch possesses twice as much downlink bandwidth to the GPUs as uplink bandwidth to the spine, assuming mathematically that not all GPUs will communicate across the spine simultaneously.