Communication-Computation Overlap
Communication-computation overlap explicitly hides network latency by executing GPU mathematical operations simultaneously with network data transfers.
Source: mortalapps.com- Communication-computation overlap explicitly hides network latency by executing GPU mathematical operations simultaneously with network data transfers.
- Frameworks like Megatron-LM strategically chunk Data Parallel (DP), Tensor Parallel (TP), and Pipeline Parallel (PP) communication to interleave with non-dependent compute kernels.
- This overlapping technique is absolutely critical for large-scale Large Language Model (LLM) training, dramatically increasing overall Model Flops Utilization (MFU).
- Successful overlapping requires highly sophisticated CUDA stream management, ensuring GPU SMs are partitioned appropriately between NCCL channels and heavy GEMM operations.
Why This Matters
In distributed training, GPUs frequently spend a massive percentage of their time blocked, waiting idly for data to arrive over the network before executing the next mathematical layer of a neural network. If this communication cannot be hidden, the theoretical peak TeraFLOPS of the GPU are completely wasted. Overlapping transforms the execution graph so the GPU continuously processes math while the NIC continuously fetches the next required payload in the background, effectively minimizing exposed communication costs to zero.
Core Intuition
Think of a busy restaurant kitchen. A chef (the GPU) must chop vegetables (compute) and wait for a delivery driver (the network) to bring the meat. If the chef stands idle, waiting for all ingredients to arrive before beginning to cook, precious time is wasted. Overlap means the chef starts chopping the vegetables immediately, and by the time they are done, the delivery driver arrives with the meat. Both resources (the chef and the delivery driver) are fully utilized in parallel, maximizing the kitchen's output.
Technical Deep Dive
Overlapping is managed differently depending on the chosen parallelism strategy. For Data Parallelism (DP), Distributed Optimizers (like ZeRO or FSDP) require a reduce-scatter of gradients and an all-gather of updated parameters. The framework chunks the DP communication by the granularity of a Transformer layer, hiding the collective operations beneath the computation of the adjacent layer. For Pipeline Parallelism (PP), during the 1F1B (One Forward, One Backward) steady state, the framework interleaves P2P activation sends and receives with non-dependent micro-batch computations. Setting batch_p2p_comm=False forces separate kernels for send and receive, improving overlapping efficiency. For Expert Parallelism (EP) in Mixture of Experts (MoE) architectures, massive all-to-all token dispatch communications are hidden directly behind the computation of the expert Feed-Forward Networks (FFNs).