Pipeline Bubble Elimination
Pipeline Parallelism (PP) distributes model layers across multiple nodes but introduces severe idle periods known as "bubbles."
Source: mortalapps.com- Pipeline Parallelism (PP) distributes model layers across multiple nodes but introduces severe idle periods known as "bubbles."
- Interleaved 1F1B schedules break models into smaller chunks to overlap computation.
- TD-Pipe (Temporally-Disaggregated Pipeline) entirely decouples prefill and decode to eliminate phase-switching bubbles.
Why This Matters
When 70B+ models span across multiple physical servers due to memory constraints, pipeline parallelism is mandatory. However, standard naive pipelines leave downstream GPUs sitting idle for up to 50% of the processing time, devastating the financial ROI of multi-million dollar data centers.
Core Intuition
Think of a car assembly line. If the engine installation (prefill) takes 1 hour, and painting (decode) takes 5 minutes, placing them sequentially on the same line causes massive traffic jams and idle workers. TD-Pipe temporally separates them: the line exclusively installs engines for days, stores them, and then exclusively paints them later, ensuring neither station ever waits.
Technical Deep Dive
TD-Pipe completely decouples prefill and decode in the temporal dimension. Because massive prefill batches take drastically longer to clear the pipeline than decode micro-batches, mixing them exacerbates bubbles. TD-Pipe locks into the highly efficient prefill phase, storing KV caches. It uses a BERT-based AI greedy predictor to estimate future output token lengths. It switches to the decode phase only when its Spatial Intensity (decode performance vs peak capacity) drops below Temporal Intensity (the penalty of switching).