AI Networking

Communication-Computation Overlap

Communication-computation overlap explicitly hides network latency by executing GPU mathematical operations simultaneously with network data transfers.

Published June 1, 2026 · By MortalApps · 5 min read · ~901 words

TL;DR

Communication-computation overlap explicitly hides network latency by executing GPU mathematical operations simultaneously with network data transfers.
Frameworks like Megatron-LM strategically chunk Data Parallel (DP), Tensor Parallel (TP), and Pipeline Parallel (PP) communication to interleave with non-dependent compute kernels.
This overlapping technique is absolutely critical for large-scale Large Language Model (LLM) training, dramatically increasing overall Model Flops Utilization (MFU).
Successful overlapping requires highly sophisticated CUDA stream management, ensuring GPU SMs are partitioned appropriately between NCCL channels and heavy GEMM operations.

Why This Matters Intuition Deep Dive Takeaways Related

Why This Matters

In distributed training, GPUs frequently spend a massive percentage of their time blocked, waiting idly for data to arrive over the network before executing the next mathematical layer of a neural network. If this communication cannot be hidden, the theoretical peak TeraFLOPS of the GPU are completely wasted. Overlapping transforms the execution graph so the GPU continuously processes math while the NIC continuously fetches the next required payload in the background, effectively minimizing exposed communication costs to zero.

Core Intuition

Think of a busy restaurant kitchen. A chef (the GPU) must chop vegetables (compute) and wait for a delivery driver (the network) to bring the meat. If the chef stands idle, waiting for all ingredients to arrive before beginning to cook, precious time is wasted. Overlap means the chef starts chopping the vegetables immediately, and by the time they are done, the delivery driver arrives with the meat. Both resources (the chef and the delivery driver) are fully utilized in parallel, maximizing the kitchen's output.

Technical Deep Dive

Overlapping is managed differently depending on the chosen parallelism strategy. For Data Parallelism (DP), Distributed Optimizers (like ZeRO or FSDP) require a reduce-scatter of gradients and an all-gather of updated parameters. The framework chunks the DP communication by the granularity of a Transformer layer, hiding the collective operations beneath the computation of the adjacent layer. For Pipeline Parallelism (PP), during the 1F1B (One Forward, One Backward) steady state, the framework interleaves P2P activation sends and receives with non-dependent micro-batch computations. Setting batch_p2p_comm=False forces separate kernels for send and receive, improving overlapping efficiency. For Expert Parallelism (EP) in Mixture of Experts (MoE) architectures, massive all-to-all token dispatch communications are hidden directly behind the computation of the expert Feed-Forward Networks (FFNs).

Key Takeaways

Overlapping hides network latency by executing asynchronous NCCL calls concurrently alongside compute kernels.

It is heavily utilized in DP parameter syncing, PP activation passing, and MoE all-to-all dispatch.

Effective overlapping requires incredibly careful management of CUDA streams and SM partitioning to avoid resource starvation.

Pipeline overlapping is most effective during the 1F1B steady state, while the fill and flush phases remain exposed.

Why This Matters

Core Intuition

Technical Deep Dive

Key Takeaways

Internals / Execution Flow

Performance Implications

Real-World Usage

Common Bottlenecks

Optimization Strategies

Tools & Frameworks

Interview Guidance

Related Concepts