← Infrastructure Distributed AI Training
Infrastructure

ZeRO Optimization Architecture

Zero Redundancy Optimizer (ZeRO) systematically eliminates memory redundancy inherent in distributed data-parallel training protocols.

Source: mortalapps.com
TL;DR
  • Zero Redundancy Optimizer (ZeRO) systematically eliminates memory redundancy inherent in distributed data-parallel training protocols.
  • The architecture is logically partitioned into three discrete stages: ZeRO-1 (Optimizer States), ZeRO-2 (Gradients), and ZeRO-3 (Parameters).
  • Enables the training of models exceeding trillions of parameters by scaling aggregate memory linearly with the number of participating GPUs.
  • Forms the foundational backend architecture of Microsoft's DeepSpeed distributed training framework.

Why This Matters

As foundation models scale exponentially, the optimizer states (e.g., Adam's running momentum and variance) consume catastrophic amounts of memory. In mixed-precision FP16 training, the Adam optimizer state requires 12 bytes per parameter (comprising FP32 momentum, FP32 variance, and FP32 master weights). For a 100-billion parameter model, this translates to 1.2 Terabytes of memory strictly for the optimizer—mathematically impossible to house on a single GPU. ZeRO fragments this state seamlessly across the cluster, resolving the memory capacity bottleneck without incurring the massive compute idle times (pipeline bubbles) characteristic of Pipeline Parallelism.

Core Intuition

The intuition behind ZeRO is fractional ownership. If a cluster of 100 GPUs is training a single model, maintaining 100 mathematically identical copies of the optimizer state wastes 99% of the cluster's high-bandwidth memory. ZeRO-1 dictates that each GPU is exclusively responsible for optimizing a specific 1% slice of the model. GPU 0 updates only the first 1% of parameters, GPU 1 updates the subsequent 1%, and so forth. To execute this, gradients must be precisely routed to the correct governing GPU (ZeRO-2). In its final iteration (ZeRO-3), the parameters themselves are distributed and fetched dynamically. Crucially, ZeRO partitions static model states but does not partition dynamic activation memory.

Technical Deep Dive

The memory footprint calculations underlying ZeRO are critical engineering heuristics. Given a model with parameter count spread across GPUs:

ZeRO StageSharded State
Memory per GPU FormulaDescription
Baseline (DDP)None
Full replication of weights (), gradients (), optimizer ().
ZeRO-1 ()Optimizer
Optimizer state is sharded. No extra communication overhead.
ZeRO-2 ()Opt + Gradients
Replaces AllReduce with ReduceScatter for gradients.
ZeRO-3 ()Opt + Grad + Params
Parameters sharded. Demands massive AllGather traffic.

Key Takeaways

ZeRO is an elegant mathematical refactoring of DDP that systematically shards optimizer states, gradients, and model parameters.
ZeRO-1 and ZeRO-2 retain identical communication profiles to DDP while rescuing immense volumes of HBM.
ZeRO-3 enables theoretically unbounded model parameter scaling but strictly shifts the bottleneck to cross-sectional network bandwidth.
ZeRO isolates model state memory but provides no relief for sequence-length-driven activation memory constraints.