← Infrastructure LLM Inference Systems
Infrastructure

Pipeline Bubble Elimination

Pipeline Parallelism (PP) distributes model layers across multiple nodes but introduces severe idle periods known as "bubbles."

Source: mortalapps.com
TL;DR
  • Pipeline Parallelism (PP) distributes model layers across multiple nodes but introduces severe idle periods known as "bubbles."
  • Interleaved 1F1B schedules break models into smaller chunks to overlap computation.
  • TD-Pipe (Temporally-Disaggregated Pipeline) entirely decouples prefill and decode to eliminate phase-switching bubbles.

Why This Matters

When 70B+ models span across multiple physical servers due to memory constraints, pipeline parallelism is mandatory. However, standard naive pipelines leave downstream GPUs sitting idle for up to 50% of the processing time, devastating the financial ROI of multi-million dollar data centers.

Core Intuition

Think of a car assembly line. If the engine installation (prefill) takes 1 hour, and painting (decode) takes 5 minutes, placing them sequentially on the same line causes massive traffic jams and idle workers. TD-Pipe temporally separates them: the line exclusively installs engines for days, stores them, and then exclusively paints them later, ensuring neither station ever waits.

Technical Deep Dive

TD-Pipe completely decouples prefill and decode in the temporal dimension. Because massive prefill batches take drastically longer to clear the pipeline than decode micro-batches, mixing them exacerbates bubbles. TD-Pipe locks into the highly efficient prefill phase, storing KV caches. It uses a BERT-based AI greedy predictor to estimate future output token lengths. It switches to the decode phase only when its Spatial Intensity (decode performance vs peak capacity) drops below Temporal Intensity (the penalty of switching).

Key Takeaways

Standard PP creates massive idle bubbles due to the sequential nature of layers.
Interleaved PP overlaps execution by assigning multiple disconnected chunks to a single GPU.
Mixing prefill and decode in pipelines worsens bubbles; temporal disaggregation (TD-Pipe) separates them entirely.
AI prediction and spatial-temporal math determine optimal switching points.