← Infrastructure Distributed AI Training
Infrastructure

Sequence Parallelism

Partitions the massive input activation tensors directly along the sequence dimension across the GPU cluster.

Source: mortalapps.com
TL;DR
  • Partitions the massive input activation tensors directly along the sequence dimension () across the GPU cluster.
  • Megatron SP (TP-sp) modifies TP's AllReduce into a ReduceScatter/AllGather combination to keep activations sharded.
  • DeepSpeed Ulysses utilizes All-to-All collective communication to shuffle sequences into head-partitioned layouts, allowing unconstrained scaling with FlashAttention.
  • An absolute necessity for training on massive contexts (e.g., 100K+ tokens) where activation memory for a single sequence eclipses GPU VRAM.

Why This Matters

With the industry paradigm shifting toward long-context foundation models (e.g., GPT-4 128K, Claude 200K, Gemini 1M), the memory consumed strictly by activations scales quadratically with sequence length during attention mechanisms, and linearly during standard linear layers. A batch containing 100K token sequences will instantaneously OOM any single GPU, entirely irrespective of the underlying model's parameter size. Sequence Parallelism (SP) mathematically ensures this massive activation memory is distributed efficiently as across participating GPUs.

Core Intuition

Instead of GPU 1 holding the entire massive sequence "The quick brown fox jumps over the lazy dog", GPU 1 holds "The quick", GPU 2 holds "brown fox", and so on. For linear operations like LayerNorm or GeLU, absolutely no communication is required because token processing remains completely independent. However, for the Attention mechanism, where every single token must mathematically interact with every other token, the system must perform one of two actions: either gather all tokens temporarily to compute attention (Megatron SP), or geometrically rearrange the tensor over the network so each GPU holds a full sequence, but only for a specific subset of attention heads (DeepSpeed Ulysses).

Technical Deep Dive

There are three predominant, competing architectures for handling Sequence Parallelism 23:

SP StrategyCommunication Pattern
Comm Volume / GPUConstraints
Megatron SP (TP-sp)AllGather + ReduceScatter
HighBound by TP size / NVLink
DeepSpeed UlyssesTwo All-to-Alls per layer
SP size Number of Heads 26
Ring AttentionP2P Ring passes
None (Infinite scaling)

Megatron SP (TP-sp): Explicitly designed to ride alongside existing Tensor Parallelism architectures. It proactively prevents activation replication post-TP by modifying the forward pass arithmetic. AllReduce mathematically equates to ReduceScatter + AllGather. By terminating the TP block with just a ReduceScatter, the activations remain safely sharded along the sequence dimension (). Right before the subsequent TP block begins, an AllGather reconstitutes the sequence locally. DeepSpeed Ulysses: Partitions the input tensor strictly along the sequence dimension. Prior to the attention calculation, it executes a massive All-to-All collective across the fabric to transpose the tensor from a sequence-partitioned layout to a head-partitioned layout. GPU 1 receives Head 1 for all tokens across the cluster. Attention is computed entirely locally (making it fully compatible with FlashAttention 3), followed by a secondary All-to-All to transpose the tensor back.

Key Takeaways

Sequence Parallelism systematically attacks the quadratic activation memory bottleneck inherent to transformers.
Megatron SP resourcefully repurposes existing TP AllReduce collectives to shard sequence activations.
DeepSpeed Ulysses leverages All-to-All collectives for elegant head-sharding, maintaining perfect FlashAttention compatibility.
Hybrid formulations are required to overcome Ulysses' strict attention-head scaling limitations.