Distributed AI Training

Sequence Parallelism

Partitions the massive input activation tensors directly along the sequence dimension across the GPU cluster.

Published June 1, 2026 · By MortalApps · 5 min read · ~940 words

TL;DR

Partitions the massive input activation tensors directly along the sequence dimension () across the GPU cluster.
Megatron SP (TP-sp) modifies TP's AllReduce into a ReduceScatter/AllGather combination to keep activations sharded.
DeepSpeed Ulysses utilizes All-to-All collective communication to shuffle sequences into head-partitioned layouts, allowing unconstrained scaling with FlashAttention.
An absolute necessity for training on massive contexts (e.g., 100K+ tokens) where activation memory for a single sequence eclipses GPU VRAM.

Why This Matters Intuition Deep Dive Takeaways Related

Why This Matters

With the industry paradigm shifting toward long-context foundation models (e.g., GPT-4 128K, Claude 200K, Gemini 1M), the memory consumed strictly by activations scales quadratically with sequence length during attention mechanisms, and linearly during standard linear layers. A batch containing 100K token sequences will instantaneously OOM any single GPU, entirely irrespective of the underlying model's parameter size. Sequence Parallelism (SP) mathematically ensures this massive activation memory is distributed efficiently as across participating GPUs.

Core Intuition

Instead of GPU 1 holding the entire massive sequence "The quick brown fox jumps over the lazy dog", GPU 1 holds "The quick", GPU 2 holds "brown fox", and so on. For linear operations like LayerNorm or GeLU, absolutely no communication is required because token processing remains completely independent. However, for the Attention mechanism, where every single token must mathematically interact with every other token, the system must perform one of two actions: either gather all tokens temporarily to compute attention (Megatron SP), or geometrically rearrange the tensor over the network so each GPU holds a full sequence, but only for a specific subset of attention heads (DeepSpeed Ulysses).

Technical Deep Dive

There are three predominant, competing architectures for handling Sequence Parallelism 23:

SP Strategy	Communication Pattern
Comm Volume / GPU	Constraints
Megatron SP (TP-sp)	AllGather + ReduceScatter
High	Bound by TP size / NVLink
DeepSpeed Ulysses	Two All-to-Alls per layer
	SP size Number of Heads 26
Ring Attention	P2P Ring passes
	None (Infinite scaling)

Megatron SP (TP-sp): Explicitly designed to ride alongside existing Tensor Parallelism architectures. It proactively prevents activation replication post-TP by modifying the forward pass arithmetic. AllReduce mathematically equates to ReduceScatter + AllGather. By terminating the TP block with just a ReduceScatter, the activations remain safely sharded along the sequence dimension (). Right before the subsequent TP block begins, an AllGather reconstitutes the sequence locally. DeepSpeed Ulysses: Partitions the input tensor strictly along the sequence dimension. Prior to the attention calculation, it executes a massive All-to-All collective across the fabric to transpose the tensor from a sequence-partitioned layout to a head-partitioned layout. GPU 1 receives Head 1 for all tokens across the cluster. Attention is computed entirely locally (making it fully compatible with FlashAttention 3), followed by a secondary All-to-All to transpose the tensor back.

Key Takeaways

Sequence Parallelism systematically attacks the quadratic activation memory bottleneck inherent to transformers.

Megatron SP resourcefully repurposes existing TP AllReduce collectives to shard sequence activations.

DeepSpeed Ulysses leverages All-to-All collectives for elegant head-sharding, maintaining perfect FlashAttention compatibility.

Hybrid formulations are required to overcome Ulysses' strict attention-head scaling limitations.

Why This Matters

Core Intuition

Technical Deep Dive

Key Takeaways

Internals / Execution Flow

Performance Implications

Real-World Usage

Common Bottlenecks

Optimization Strategies

Tools & Frameworks

Interview Guidance

Related Concepts