Micro-batching Algorithms
Mathematically divides a massive global training batch into smaller, sequential "micro-batches" to execute concurrently across a pipeline topology.
Source: mortalapps.com- Mathematically divides a massive global training batch into smaller, sequential "micro-batches" to execute concurrently across a pipeline topology.
- Directly controls the critical engineering tradeoff between the pipeline bubble size (idle compute time) and GPU activation memory capacity.
- Advanced programmatic schedules (such as Interleaved 1F1B and Zero-Bubble) creatively manipulate the strict ordering of micro-batch execution to squeeze out absolute maximum hardware throughput.
- Serves as the fundamental, non-negotiable tuning parameter for scaling any Pipeline Parallelism (PP) deployment.
Why This Matters
Without the implementation of micro-batching, Pipeline Parallelism is architecturally useless; GPU 1 would process an entire global batch, while GPUs 2 through 8 sit completely idle waiting for completion. By meticulously slicing the batch into micro-batches, GPU 1 can continuously push fractional work downstream, ensuring the entire pipeline remains occupied. However, the exact mathematical schedule dictating when each specific GPU computes a Forward (F) or Backward (B) pass dictates the ultimate memory footprint and the total hardware utilization of the multi-million dollar GPU cluster.
Core Intuition
Consider an 8-stage physical pipeline () processing 16 total micro-batches (
). In a mathematically naive schedule (such as GPipe), GPU 1 executes 16 consecutive Forwards, followed eventually by GPU 2 executing 16 Forwards. In this scenario, GPU 1 must possess the physical memory capacity to store the vast activation memory for all 16 Forwards until the backward pass eventually returns from the bottom of the pipeline. In a vastly superior schedule (1F1B), once the physical pipeline fills up, GPU 1 strictly alternates: 1 Forward, 1 Backward. This aggressively and immediately frees up activation memory after every backward step, allowing the massive model to train without catastrophic OOMs.
Technical Deep Dive
| Micro-batch Schedule | Steady-State Pattern |
|---|---|
| Peak Memory Factor | Bubble Mitigation Strategy |
| GPipe | FFFFF... BBBBB... |
![]() | Massive micro-batch counts |
| 1F1B | F, B, F, B, F, B |
![]() | Alternating execution |
| Interleaved 1F1B | F, B (Virtual mapping) |
| Variable | Multiple chunks per GPU |
| Zero Bubble | Split B and W passes |
| Controllable | Shifting W into idle slots |
| 1F1B (One-Forward-One-Backward): After the initial idle ramp-up phase, the pipeline achieves a steady state of alternating F, B, F, B. The peak memory footprint is strictly defined by the maximum number of micro-batches concurrently in flight, which equates to exactly | Interleaved 1F1B: The model's layers are divided into |
Zero Bubble (ZB-H1/H2): This radical scheduling algorithm completely breaks the backward pass into a B pass (activation gradients) and a W pass (weight gradients). Since W passes possess no rigid inter-stage mathematical dependencies, the scheduler shifts W passes explicitly into the empty pipeline bubbles generated during the ramp-up and ramp-down phases. Peak memory transforms into a highly controllable variable based specifically on how many F passes the engineer explicitly allows to be eagerly scheduled.

