Expert Parallelism for MoE
Distributes the specialized sub-networks (Experts) of a Mixture-of-Experts (MoE) foundation model across distinct, physically separated GPUs.
Source: mortalapps.com- Distributes the specialized sub-networks (Experts) of a Mixture-of-Experts (MoE) foundation model across distinct, physically separated GPUs.
- Utilizes massive All-to-All network collectives to dynamically route tokens to their designated experts and return the processed activations.
- Employs token dropping thresholds and capacity factors to handle severe load imbalances when tokens disproportionately favor a single expert.
- Highly sensitive to interconnect topology; Wide-EP deployed on coherent rack-scale architectures (NVL72) represents the absolute frontier of MoE scaling.
Why This Matters
MoE architectures (exemplified by models like DeepSeek-R1 possessing 256 experts, Mixtral, and GPT-4) vastly increase the raw parameter count and learning capacity of a model without proportionally inflating the active computational FLOPs per token. However, storing 671 billion parameters (DeepSeek) requires catastrophic amounts of memory. Expert Parallelism (EP) mathematically solves this by placing different experts onto different GPUs, unlocking scalable training and low-latency inference for massively sparse models.
Core Intuition
In a standard dense neural network layer, every single token is processed by the exact same Multi-Layer Perceptron (MLP). In an MoE layer, a specialized Router network predicts which MLP (Expert) is mathematically best suited for each specific token (e.g., top-2 routing). If Expert 1 physically resides on GPU A, and Expert 2 physically resides on GPU B, a token currently on GPU A that is bound for Expert 2 must be packaged, sent over the network fabric to GPU B, processed through Expert, and subsequently returned back over the network to GPU A to continue its journey through the subsequent dense layers of the network.
Technical Deep Dive
The primary and non-negotiable communication primitive dictating EP performance is the All-to-All collective.
| EP Phase | Collective | Network Payload | Compute Action |
|---|---|---|---|
| Dispatch | All-to-All | Tokens routed to remote GPUs | Router logits selection |
| Process | None | None | Fused GroupGEMM |
| Gather | All-to-All | Processed activations returned | Weighted sum combination |
Let represent batch size,
sequence length, and
the number of experts. The gating network produces logits and Top-k experts are chosen. An All-to-All collective severely reshuffles the tokens from a standard Data-Parallel layout into a strict Expert-Parallel layout. Fused GroupGEMM kernels then process the tokens. By utilizing Wide-EP (spreading experts over massive quantities of GPUs), the number of experts held per GPU drops exponentially. This drastically increases the arithmetic intensity (FLOPs per byte of weight loaded from HBM) and balances the delicate compute/memory ratio inside the kernel. A secondary All-to-All returns the tokens back to their origin GPUs, where they are multiplied by the router probabilities to maintain differentiability.