← Infrastructure Distributed AI Training
Infrastructure

Distributed Data Parallelism (DDP)

Replicates the entire model, optimizer state, and gradients across every participating GPU.

Source: mortalapps.com
TL;DR
  • Replicates the entire model, optimizer state, and gradients across every participating GPU.
  • Distributes only the input data micro-batches across the worker GPUs.
  • Optimizes communication by utilizing asynchronous gradient AllReduce operations overlapped with the backward pass.
  • Most effective for models that fit entirely within a single GPU's memory but require massive data throughput to converge.

Why This Matters

Distributed Data Parallelism serves as the foundational scaling paradigm for modern deep learning infrastructure. While frontier foundation models have largely outgrown pure DDP due to hardware memory limits, a rigorous understanding of its mechanics is strictly necessary because all advanced sharding strategies—such as Fully Sharded Data Parallelism and ZeRO—are architectural refactorings of the DDP mathematical formulation. In production clusters encompassing thousands of GPUs, DDP provides the highest computational efficiency and Model FLOPs Utilization (MFU), provided the model state fits entirely within high-bandwidth memory (HBM).

Core Intuition

The mental model for Distributed Data Parallelism operates on the principle of "Compute Locally, Synchronize Globally." Every GPU within the distributed process group possesses a mathematically identical snapshot of the model weights at the start of any given training step . By feeding non-overlapping shards of the global dataset to each GPU, every worker computes a unique, localized gradient based on its specific data slice. To ensure the model remains synchronized for step , the infrastructure must average these local gradients across all GPUs before the optimizer applies the weight update. The fundamental systems tradeoff here involves compute independence versus memory redundancy: zero inter-node communication is required during the forward pass, but memory efficiency scales inversely with the cluster size.

Technical Deep Dive

DDP requires three primary state components to exist simultaneously in HBM: Parameters (), Gradients (), and Optimizer States (). For a model containing parameters trained in mixed precision, the memory footprint is highly deterministic.

ComponentPrecisionMemory Requirement
DescriptionParametersFP16/BF16
bytesActive model weights used for forward/backward computeGradients
FP16/BF16 bytesComputed gradients for the current batch
Optimizer StatesFP32 bytes
Adam optimizer requires FP32 master weights (4 bytes), FP32 momentum (4 bytes), and FP32 variance (4 bytes)TotalMixed
bytesBase memory requirement prior to activations 1The communication pattern relies entirely on the AllReduce collective. When a layer's backward pass finishes calculating its gradient, DDP does not wait for the entire backward pass to complete. Instead, it assigns the gradient to a predefined communication bucket. Once a bucket reaches a capacity threshold (e.g., 25MB), an asynchronous AllReduce is triggered over the network fabric, aggregating gradients while the backward pass continues computing upstream layers.

Key Takeaways

DDP mandates mathematically identical weights across all GPUs and relies on synchronized, averaged gradients.
Overlapping the AllReduce collective with the backward pass computation is the primary driver of its near-linear scaling efficiency.
Memory redundancy is the fundamental architectural limitation that birthed modern sharded paradigms.
Performance is bottlenecked by the slowest GPU in the process group due to strict synchronization barriers.