← Infrastructure Distributed AI Training
Infrastructure

Pipeline Parallelism

Slices the neural network longitudinally, distributing different sequential layers across different GPUs (e.g., Layers 1-4 on GPU, Layers 5-8 on GPU 2).

Source: mortalapps.com
TL;DR
  • Slices the neural network longitudinally, distributing different sequential layers across different GPUs (e.g., Layers 1-4 on GPU, Layers 5-8 on GPU 2).
  • Micro-batching algorithms are injected to push concurrent work into the pipeline, preventing massive hardware idle times.
  • The fundamental engineering tradeoff revolves around managing the "pipeline bubble"—idle computation time where GPUs stall waiting for upstream or downstream data.
  • Essential for scaling models whose sheer depth causes the total parameter count to far exceed the Tensor Parallelism intra-node domain capacity.

Why This Matters

When a model surpasses approximately 50 billion parameters, it can no longer fit efficiently within a single 8-GPU node utilizing solely Tensor Parallelism. To scale execution across the broader datacenter without incurring the catastrophic cross-node latency penalties inherent to TP, Pipeline Parallelism (PP) distributes the model's layers across multiple nodes. Because PP only transmits specific activation tensors between boundary layers over the network (Point-to-Point communication), it remains highly resilient to slower inter-node InfiniBand or Ethernet interconnects.

Core Intuition

The mental model parallels an automotive assembly line. If GPU 1 builds the chassis (Layers 1-10) and GPU 2 installs the engine (Layers 11-20), GPU 2 sits completely idle while GPU 1 processes the first chassis. To mitigate this gross inefficiency, the global training batch is split into smaller "micro-batches." GPU 1 processes micro-batch 1 and passes it to GPU 2. While GPU 2 processes micro-batch, GPU 1 immediately begins processing micro-batch 2. However, at the absolute start (ramp-up phase) and end (ramp-down phase) of the global batch, some GPUs inevitably lack work. This unpreventable idle time constitutes the "pipeline bubble."

Technical Deep Dive

The magnitude of the pipeline bubble in standard 1F1B (One-Forward-One-Backward) scheduling is mathematically defined by the equation , where represents the number of pipeline stages, and represent the forward and backward execution times.

Scheduling AlgorithmBubble SizeMemory Footprint
Communication VolumeGPipeHigh
Massive (All micro-batches)Base Point-to-Point1F1B
HighModerate (Bounded by )Base Point-to-Point
Interleaved 1F1BMedium ()Moderate
High ( Base)Zero Bubble (ZB-H1)Near Zero
ControllableBase Point-to-Point 16To mathematically reduce the bubble without altering the micro-batch count, Interleaved 1F1B (Virtual Pipeline Parallelism) assigns multiple disjoint, non-contiguous chunks of layers to the same physical GPU. If represents the number of virtual chunks, the bubble size reduces proportionally to , sacrificing increased point-to-point communication volume across the fabric for higher SM utilization.

Key Takeaways

PP uniquely allows model execution to span across latency-bound nodes with minimal communication bandwidth requirements.
The pipeline bubble (idle SM time) is the primary enemy of PP scaling efficiency.
Splitting the traditional backward pass into B (activation) and W (weight) gradient passes unlocks Zero-Bubble scheduling.
Interleaved 1F1B trades increased P2P communication volume over the network for a mathematically smaller pipeline bubble.