Distributed AI Training

Micro-batching Algorithms

Mathematically divides a massive global training batch into smaller, sequential "micro-batches" to execute concurrently across a pipeline topology.

Published June 1, 2026 · By MortalApps · 5 min read · ~980 words

TL;DR

Mathematically divides a massive global training batch into smaller, sequential "micro-batches" to execute concurrently across a pipeline topology.
Directly controls the critical engineering tradeoff between the pipeline bubble size (idle compute time) and GPU activation memory capacity.
Advanced programmatic schedules (such as Interleaved 1F1B and Zero-Bubble) creatively manipulate the strict ordering of micro-batch execution to squeeze out absolute maximum hardware throughput.
Serves as the fundamental, non-negotiable tuning parameter for scaling any Pipeline Parallelism (PP) deployment.

Why This Matters Intuition Deep Dive Takeaways Related

Why This Matters

Without the implementation of micro-batching, Pipeline Parallelism is architecturally useless; GPU 1 would process an entire global batch, while GPUs 2 through 8 sit completely idle waiting for completion. By meticulously slicing the batch into micro-batches, GPU 1 can continuously push fractional work downstream, ensuring the entire pipeline remains occupied. However, the exact mathematical schedule dictating when each specific GPU computes a Forward (F) or Backward (B) pass dictates the ultimate memory footprint and the total hardware utilization of the multi-million dollar GPU cluster.

Core Intuition

Consider an 8-stage physical pipeline () processing 16 total micro-batches (). In a mathematically naive schedule (such as GPipe), GPU 1 executes 16 consecutive Forwards, followed eventually by GPU 2 executing 16 Forwards. In this scenario, GPU 1 must possess the physical memory capacity to store the vast activation memory for all 16 Forwards until the backward pass eventually returns from the bottom of the pipeline. In a vastly superior schedule (1F1B), once the physical pipeline fills up, GPU 1 strictly alternates: 1 Forward, 1 Backward. This aggressively and immediately frees up activation memory after every backward step, allowing the massive model to train without catastrophic OOMs.

Technical Deep Dive

Micro-batch Schedule	Steady-State Pattern
Peak Memory Factor	Bubble Mitigation Strategy
GPipe	FFFFF... BBBBB...
	Massive micro-batch counts
1F1B	F, B, F, B, F, B
	Alternating execution
Interleaved 1F1B	F, B (Virtual mapping)
Variable	Multiple chunks per GPU
Zero Bubble	Split B and W passes
Controllable	Shifting W into idle slots
1F1B (One-Forward-One-Backward): After the initial idle ramp-up phase, the pipeline achieves a steady state of alternating F, B, F, B. The peak memory footprint is strictly defined by the maximum number of micro-batches concurrently in flight, which equates to exactly for the initial stage.	Interleaved 1F1B: The model's layers are divided into virtual chunks per physical stage. The mathematical bubble size drops to , but the memory footprint expands, and P2P communication volume linearly increases by .

Zero Bubble (ZB-H1/H2): This radical scheduling algorithm completely breaks the backward pass into a B pass (activation gradients) and a W pass (weight gradients). Since W passes possess no rigid inter-stage mathematical dependencies, the scheduler shifts W passes explicitly into the empty pipeline bubbles generated during the ramp-up and ramp-down phases. Peak memory transforms into a highly controllable variable based specifically on how many F passes the engineer explicitly allows to be eagerly scheduled.

Key Takeaways

Micro-batching algorithms dictate the absolute computational rhythm and efficiency of Pipeline Parallelism.

1F1B schedules strictly alternate forwards and backwards to maintain a safely bounded activation memory footprint.

ZB-PP aggressively splits backwards into distinct B-passes and W-passes, utilizing deferred W-passes to fill hardware idle time.

Why This Matters

Core Intuition

Technical Deep Dive

Key Takeaways

Internals / Execution Flow

Performance Implications

Real-World Usage

Common Bottlenecks

Optimization Strategies

Tools & Frameworks

Interview Guidance

Related Concepts