← Infrastructure Transformer Systems
Infrastructure

Context Parallelism in Attention

Context Parallelism (CP) is an architecture pattern that shards the input sequence across multiple GPUs to enable training and inference on multi-million

Source: mortalapps.com
TL;DR
  • Context Parallelism (CP) is an architecture pattern that shards the input sequence across multiple GPUs to enable training and inference on multi-million token contexts.
  • It specifically distributes the attention computation itself, unlike Sequence Parallelism (SP) which only distributes non-attention auxiliary layers.
  • The two dominant implementations are DeepSpeed Ulysses (All-to-All communication) and Ring Attention (P2P communication).
  • Choosing between them depends entirely on hardware topology, sequence length, and the number of attention heads.

Why This Matters

As AI pushes towards analyzing whole code repositories, hour-long videos, and complex agent trajectories, single-GPU memory cannot hold the attention matrix or even the activations. Tensor Parallelism (TP) splits model weights, Pipeline Parallelism (PP) splits layers, but only Context Parallelism specifically addresses the memory scaling of the context length dimension itself. Without CP, context lengths are hard-capped by the memory of a single accelerator.

Core Intuition

If you have a massive sequence, how do you divide the labor? Megatron Sequence Parallelism (SP) keeps the entire sequence on all GPUs for the attention calculation, but shards the sequence for Dropout and LayerNorm. Context Parallelism (CP) actually shards the attention mechanism. The Ulysses approach says: "I will compute attention for the whole sequence, but only for Head 1. You compute the whole sequence for Head 2." The Ring approach says: "I will compute all heads, but only for the first,000 tokens. You do the next,000, and we'll trade notes."

Technical Deep Dive

DeepSpeed Ulysses splits the sequence across GPUs, then uses All-to-All collective communication to transpose the layout so each GPU has the full sequence but only for a subset of attention heads. A critical constraint of Ulysses is that the degree of parallelism () cannot exceed the number of attention heads (). A model with 64 heads cannot use more than 64 GPUs for Ulysses. It maintains constant communication volume by proportionally increasing both sequence length and devices. Conversely, Ring Attention shards the sequence, and each GPU computes Multi-Head Attention (MHA) for its specific sequence chunk, passing KV blocks in a ring topology. While Ring has no theoretical limit on parallelization degree, communication latency via P2P transfers can become a major bottleneck.

Key Takeaways

Context Parallelism distributes actual attention computation to scale context lengths.
DeepSpeed Ulysses shards across heads using All-to-All collective communication.
Ring Attention shards across the sequence using P2P block transfers.
Ulysses is physically constrained by the number of attention heads, which is highly limiting in GQA.