← Infrastructure Distributed AI Training
Infrastructure

Micro-batching Algorithms

Mathematically divides a massive global training batch into smaller, sequential "micro-batches" to execute concurrently across a pipeline topology.

Source: mortalapps.com
TL;DR
  • Mathematically divides a massive global training batch into smaller, sequential "micro-batches" to execute concurrently across a pipeline topology.
  • Directly controls the critical engineering tradeoff between the pipeline bubble size (idle compute time) and GPU activation memory capacity.
  • Advanced programmatic schedules (such as Interleaved 1F1B and Zero-Bubble) creatively manipulate the strict ordering of micro-batch execution to squeeze out absolute maximum hardware throughput.
  • Serves as the fundamental, non-negotiable tuning parameter for scaling any Pipeline Parallelism (PP) deployment.

Why This Matters

Without the implementation of micro-batching, Pipeline Parallelism is architecturally useless; GPU 1 would process an entire global batch, while GPUs 2 through 8 sit completely idle waiting for completion. By meticulously slicing the batch into micro-batches, GPU 1 can continuously push fractional work downstream, ensuring the entire pipeline remains occupied. However, the exact mathematical schedule dictating when each specific GPU computes a Forward (F) or Backward (B) pass dictates the ultimate memory footprint and the total hardware utilization of the multi-million dollar GPU cluster.

Core Intuition

Consider an 8-stage physical pipeline () processing 16 total micro-batches (). In a mathematically naive schedule (such as GPipe), GPU 1 executes 16 consecutive Forwards, followed eventually by GPU 2 executing 16 Forwards. In this scenario, GPU 1 must possess the physical memory capacity to store the vast activation memory for all 16 Forwards until the backward pass eventually returns from the bottom of the pipeline. In a vastly superior schedule (1F1B), once the physical pipeline fills up, GPU 1 strictly alternates: 1 Forward, 1 Backward. This aggressively and immediately frees up activation memory after every backward step, allowing the massive model to train without catastrophic OOMs.

Technical Deep Dive

Micro-batch ScheduleSteady-State Pattern
Peak Memory FactorBubble Mitigation Strategy
GPipeFFFFF... BBBBB...
Massive micro-batch counts
1F1BF, B, F, B, F, B
Alternating execution
Interleaved 1F1BF, B (Virtual mapping)
VariableMultiple chunks per GPU
Zero BubbleSplit B and W passes
ControllableShifting W into idle slots
1F1B (One-Forward-One-Backward): After the initial idle ramp-up phase, the pipeline achieves a steady state of alternating F, B, F, B. The peak memory footprint is strictly defined by the maximum number of micro-batches concurrently in flight, which equates to exactly for the initial stage.Interleaved 1F1B: The model's layers are divided into virtual chunks per physical stage. The mathematical bubble size drops to , but the memory footprint expands, and P2P communication volume linearly increases by .

Zero Bubble (ZB-H1/H2): This radical scheduling algorithm completely breaks the backward pass into a B pass (activation gradients) and a W pass (weight gradients). Since W passes possess no rigid inter-stage mathematical dependencies, the scheduler shifts W passes explicitly into the empty pipeline bubbles generated during the ramp-up and ramp-down phases. Peak memory transforms into a highly controllable variable based specifically on how many F passes the engineer explicitly allows to be eagerly scheduled.

Key Takeaways

Micro-batching algorithms dictate the absolute computational rhythm and efficiency of Pipeline Parallelism.
1F1B schedules strictly alternate forwards and backwards to maintain a safely bounded activation memory footprint.
ZB-PP aggressively splits backwards into distinct B-passes and W-passes, utilizing deferred W-passes to fill hardware idle time.