← Infrastructure AI Networking
Infrastructure

Lossless Ethernet and Packet Trimming

AI workloads rely heavily on RDMA over Converged Ethernet (RoCEv2), which traditionally mandates a "lossless" fabric utilizing Priority Flow Control (PFC)

Source: mortalapps.com
TL;DR
  • AI workloads rely heavily on RDMA over Converged Ethernet (RoCEv2), which traditionally mandates a "lossless" fabric utilizing Priority Flow Control (PFC) and Explicit Congestion Notification (ECN).
  • PFC often creates catastrophic secondary network issues, including pause storms, victim flows, and complete network deadlocks at high node counts.
  • Packet Trimming is a next-generation congestion mechanism that dynamically truncates congested packets—dropping the payload—but forwards the critical header.
  • Trimming provides immediate, explicit congestion signaling, enabling selective retransmission without suffering the heavy performance penalties of legacy Go-Back-N protocols.

Why This Matters

When an Ethernet switch buffer overflows, it traditionally drops packets silently. In an RDMA environment, a single dropped packet forces a Go-Back-N retransmission, meaning the sender must resend the lost packet and every single packet sent after it. This severely destroys AI training throughput. While PFC stops drops by physically halting traffic, it can accidentally halt the entire cluster in what is known as a pause storm. Packet Trimming elegantly solves this by replacing blunt-force halting with precise, immediate, surgical communication.

Core Intuition

Imagine a postal system where mail trucks (data packets) arrive at a completely full sorting facility.

In Standard Ethernet (Drop), the facility simply destroys the truck. The sender only realizes it weeks later when the recipient asks where it is, forcing the sender to recreate and resend everything.

In a PFC network (Halt), the facility tells the highway patrol to stop all incoming traffic, causing miles of gridlock (pause storms).

In Packet Trimming, the facility empties the cargo (payload) into the trash, but sends the empty truck and the shipping manifest (header) through an express lane directly to the destination. The destination immediately sees the empty truck and tells the sender exactly which specific cargo to resend (Selective Retransmission).

Technical Deep Dive

Legacy RoCEv2 Congestion Management relies on two interconnected mechanisms. Explicit Congestion Notification (ECN) acts as a proactive signal. When a switch queue hits a predetermined threshold, it marks the IP header's CE bit. The receiver replies with a Congestion Notification Packet (CNP), instructing the sender to apply the DCQCN algorithm to throttle its transmission rate. Priority Flow Control (PFC) acts as a reactive safety net. If queues continue filling to the Xoff threshold, the switch broadcasts PAUSE frames upstream, physically halting the link. Over-aggressive PFC tuning frequently leads to sudden throughput collapse.

To avoid PFC deadlocks, the Ultra Ethernet Transport (UET) introduces Packet Trimming. When a switch queue overflows, instead of dropping the packet entirely or pausing the network, the switch physically chops the packet to 64 bytes. It retains the L2/L3/L4 and UET headers, sets the DSCP marker to "trimmed," and immediately places the truncated packet into a dedicated high-priority egress queue.

Key Takeaways

PFC and DCQCN create fragile, highly complex networks extremely susceptible to pause storms and deadlocks.
RoCEv2's Go-Back-N retransmission heavily penalizes performance if a single packet is lost in the fabric.
Packet Trimming rapidly chops congested packets down to 64 bytes, forwarding the header to instantly signal loss.
This enables precise, efficient Selective Retransmission, keeping latency low and throughput high during severe incast events.