Context Parallelism in Attention
Context Parallelism (CP) is an architecture pattern that shards the input sequence across multiple GPUs to enable training and inference on multi-million
Source: mortalapps.com- Context Parallelism (CP) is an architecture pattern that shards the input sequence across multiple GPUs to enable training and inference on multi-million token contexts.
- It specifically distributes the attention computation itself, unlike Sequence Parallelism (SP) which only distributes non-attention auxiliary layers.
- The two dominant implementations are DeepSpeed Ulysses (All-to-All communication) and Ring Attention (P2P communication).
- Choosing between them depends entirely on hardware topology, sequence length, and the number of attention heads.
Why This Matters
As AI pushes towards analyzing whole code repositories, hour-long videos, and complex agent trajectories, single-GPU memory cannot hold the attention matrix or even the
activations. Tensor Parallelism (TP) splits model weights, Pipeline Parallelism (PP) splits layers, but only Context Parallelism specifically addresses the memory scaling of the context length dimension itself. Without CP, context lengths are hard-capped by the memory of a single accelerator.
Core Intuition
If you have a massive sequence, how do you divide the labor? Megatron Sequence Parallelism (SP) keeps the entire sequence on all GPUs for the attention calculation, but shards the sequence for Dropout and LayerNorm. Context Parallelism (CP) actually shards the attention mechanism. The Ulysses approach says: "I will compute attention for the whole sequence, but only for Head 1. You compute the whole sequence for Head 2." The Ring approach says: "I will compute all heads, but only for the first,000 tokens. You do the next,000, and we'll trade notes."
Technical Deep Dive
DeepSpeed Ulysses splits the sequence across GPUs, then uses All-to-All collective communication to transpose the layout so each GPU has the full sequence but only for a subset of attention heads. A critical constraint of Ulysses is that the degree of parallelism () cannot exceed the number of attention heads (
). A model with 64 heads cannot use more than 64 GPUs for Ulysses. It maintains constant communication volume by proportionally increasing both sequence length and devices. Conversely, Ring Attention shards the sequence, and each GPU computes Multi-Head Attention (MHA) for its specific sequence chunk, passing KV blocks in a ring topology. While Ring has no theoretical limit on parallelization degree, communication latency via P2P transfers can become a major bottleneck.