AI Networking

Cross-Rack GPU Communication

Cross-rack communication fundamentally relies on scale-out fabrics (InfiniBand or Ethernet) organized in Clos (Leaf-Spine) topologies, dropping bandwidth

Published June 1, 2026 · By MortalApps · 5 min read · ~885 words

TL;DR

Cross-rack communication fundamentally relies on scale-out fabrics (InfiniBand or Ethernet) organized in Clos (Leaf-Spine) topologies, dropping bandwidth drastically compared to intra-rack NVLink.
Managing cross-rack traffic is critical; collective algorithms strive to localize communication to explicitly avoid "spine crossings" which introduce severe latency and congestion.
Network designs commonly employ Rail-Optimized architectures, mapping specific GPU ranks to specific leaf switches to contain intra-rail communication to a single hop.
Hyperscalers meticulously isolate compute traffic (the backend) from storage and management traffic (the frontend) into physically separate network fabrics to prevent interference.

Why This Matters Intuition Deep Dive Takeaways Related

Why This Matters

While a single densely packed rack like the GB200 NVL72 provides a massive 130 TB/s of bandwidth, frontier models trained on 100,000 GPUs stretch across thousands of physical racks. When communication crosses the optical transceivers into the Top-of-Rack (ToR) and Spine switches, bandwidth drops precipitously from Terabytes per second to Gigabytes per second. If software parallelization strategies (like Tensor Parallelism) inadvertently spill over across racks, the entire cluster throttles down to the speed of the optical fabric, destroying training efficiency and wasting millions in capital expenditure.

Core Intuition

Imagine a group of specialized workers in a single office room (a rack). They can communicate with each other instantly (NVLink). To talk to another office across the campus (cross-rack), they must write a formal letter, hand it to a mailroom (Leaf Switch), which sends it to a central sorting facility (Spine Switch), which finally routes it to the target office. The overarching goal of cluster communication design is to ensure that intense, high-volume brainstorming happens exclusively within the office, while only finalized, infrequent summaries are mailed across campus.

Technical Deep Dive

Cross-rack fabrics are constructed utilizing multi-tier Clos topologies. Leaf Switches serve as the physical entry point for the node. In standard ToR designs, all GPUs in a server connect to a single switch. Spine Switches connect the leaves together to provide an any-to-any non-blocking or minimally oversubscribed mesh. To maintain exceptionally low latency, cross-rack communication relies on RDMA protocols (InfiniBand or RoCEv2), executing GPUDirect memory transfers so payloads never touch the host CPU.

Because optical links (AOCs and transceivers) are expensive and failure-prone at massive scales, network oversubscription is common at the spine tier. A 2:1 oversubscription means the leaf switch possesses twice as much downlink bandwidth to the GPUs as uplink bandwidth to the spine, assuming mathematically that not all GPUs will communicate across the spine simultaneously.

Key Takeaways

Cross-rack communication relies on Clos fabrics, dropping bandwidth significantly compared to internal NVLink.

Elephant flows cause severe hash collisions and congestion in standard Ethernet ECMP routing.

Separating frontend (storage) and backend (compute) networks prevents massive storage I/O from disrupting compute.

Adaptive routing and In-Network Computing are essential to maintaining robust performance across thousands of optical spine links.

Why This Matters

Core Intuition

Technical Deep Dive

Key Takeaways

Internals / Execution Flow

Performance Implications

Real-World Usage

Common Bottlenecks

Optimization Strategies

Tools & Frameworks

Interview Guidance

Related Concepts