Distributed AI Training

3D Parallelism Topologies

Mathematically combines Data Parallelism (DP), Tensor Parallelism (TP), and Pipeline Parallelism (PP) to successfully train massive foundation models.

Published June 1, 2026 · By MortalApps · 5 min read · ~908 words

TL;DR

Mathematically combines Data Parallelism (DP), Tensor Parallelism (TP), and Pipeline Parallelism (PP) to successfully train massive foundation models.
Explicitly maps specific parallel strategies to matching physical hardware interconnects (e.g., executing TP strictly on NVLink, and PP/DP across InfiniBand).
Ensures that total parameter size, dynamic activation memory, and communication overhead simultaneously fit within the cluster's hard constraints.
Forms the core architectural blueprint governing frameworks like Megatron-LM and DeepSpeed.

Why This Matters Intuition Deep Dive Takeaways Related

Why This Matters

No single parallelism strategy is computationally sufficient for training a 100B+ parameter model. DP/FSDP violently hits network bandwidth limits; TP hits extreme latency limits and mathematically cannot cross nodes; PP suffers from unacceptable pipeline bubbles. By mathematically multiplying these orthogonal strategies (), engineers can construct unified 3D parallelism topologies that explicitly exploit the strengths of each method while mitigating their compounding weaknesses, allowing execution to operate efficiently across thousands of GPUs.

Core Intuition

The intuition requires visualizing a datacenter as a series of nested, hierarchical networks. Inside a physical server, 8 GPUs are intimately connected by ultra-fast NVLink providing ~900 GB/s of bandwidth. Between servers, nodes are connected by a slower InfiniBand fabric providing ~50 GB/s.

The infrastructure engineer must place the most communication-heavy strategy (TP, with its multiple blocking AllReduce operations) exclusively inside the fast NVLink domain. PP (which only passes small activation boundaries) is placed across the slower InfiniBand network. Finally, to scale the global batch size, this entire TP/PP cluster is wrapped in Data Parallelism (or ZeRO), forming a cohesive 3D computational grid.

Technical Deep Dive

Let represent the total number of GPUs in the cluster. .

Topology Dimension	Scope
Bandwidth Requirement	Sharding Target
Tensor (TP)	Intra-node
Extreme (NVLink)	Intra-layer weights / activations
Pipeline (PP)	Inter-node
Low (P2P Send/Recv)	Inter-layer stages
Data (DP)	Global
High (AllReduce/ReduceScatter)	Gradients / Optimizer State
TP Dimension: Typically $TP \in $. It requires instantaneous synchronization.	PP Dimension: Typically $PP \in $. It has minimal bandwidth requirements, but demands careful micro-batching () configuration to reduce the pipeline bubble.

DP Dimension: . It synchronizes gradients globally at the end of the step. If Sequence Parallelism (SP) is introduced, it is typically tightly coupled with TP () to form a 4D grid, or mapped independently if utilizing DeepSpeed Ulysses. If MoEs are utilized, Expert Parallelism (EP) partially replaces or heavily augments the DP/TP domains.

Key Takeaways

3D parallelism mathematically multiplies the DP, TP, and PP domains into a unified matrix.

Physical hardware limits strictly dictate the boundaries of logical topological mappings.

TP must remain strictly intra-node; PP and DP safely bridge the inter-node gaps.

This topological formulation is the non-negotiable defacto standard for training any dense monolithic model exceeding 100B parameters.

Why This Matters

Core Intuition

Technical Deep Dive

Key Takeaways

Internals / Execution Flow

Performance Implications

Real-World Usage

Common Bottlenecks

Optimization Strategies

Tools & Frameworks

Interview Guidance

Related Concepts