Managing Lossless Ethernet
The Ethernet Evolution
Traditional Ethernet was built on the principle of Best-Effort Delivery. If a switch buffer overflowed, the router dropped packets, relying on TCP retransmissions to fill the gaps. For RDMA-based AI clusters, this "drop-and-retry" cycle introduces millisecond-level tail latencies that kill performance. **Priority Flow Control (PFC)** and **Enhanced Transmission Selection (ETS)** transform lossy Ethernet into a high-performance lossless fabric.
Interactive Lossless Fabric Simulator
LOSSLESS FABRIC SIMULATOR
Real-time Buffer Management & Scheduling
Priority Flow Control (PFC)
PFC operates at Layer 2 to pause specific traffic classes when buffers fill up (XOFF). This prevents frame loss without blocking the entire physical link, maintaining a 'lossless' environment for RDMA.
Observe how PFC pauses specific traffic classes (RDMA) while ETS manages bandwidth allocation across the link.
PFC: Priority Flow Control (802.1Qbb)
PFC operates at the link layer to provide flow control independently for each of the eight traffic classes. When a downstream switch’s buffer reaches a critical threshold (XOFF), it sends a PAUSE frame for that specific class ID (e.g., Priority 3 for RDMA).
XOFF/XON Thresholds
The "Xoff" threshold triggers a pause, while "Xon" signals resumes. At 800G, these thresholds must be tuned with micro-precision to avoid wasting buffer space or risking a drop.
Pause Storm Risk
If a device continuously sends PAUSE frames without clearing its buffer, it can block the entire traffic path. **PFC Watchdogs** are critical for identifying and disabling misbehaving endpoints.
ETS: Enhanced Transmission Selection (802.1Qaz)
While PFC prevents drops, ETS ensures fair bandwidth distribution. It allows network architects to define **Bandwidth Groups** and assign weights, replacing the primitive "Strict Priority" scheduling which could easily starve management and storage traffic.
| Traffic Class | Weight (Example) | PFC Status |
|---|---|---|
| Priority 3 (RoCE v2) | 80% | ENABLED |
| Priority 4 (Storage) | 15% | ENABLED |
| Priority 0 (Management) | 5% | DISABLED |
PFC Head-of-Line Blocking Dynamics and ETS Bandwidth Group Quantization
While PFC prevents packet loss, it introduces a subtle but destructive side effect: **Head-of-Line (HoL) blocking**. When a downstream switch sends a PFC PAUSE frame for priority 3 (RDMA traffic), the upstream switch stops transmitting all packets in that priority queue, including packets destined for other ports that are not congested. The blocked packets hold their position in the shared buffer, preventing subsequent packets behind them from being forwarded even if their destination ports are idle.
The severity of HoL blocking depends on the per-port buffer allocation scheme. Modern data center switches use **per-priority per-port** buffer pools rather than a single shared pool. The IEEE 802.1Qbb standard recommends a minimum of 2 MTU (3,000 bytes at 1500B MTU) of dedicated buffer per priority per port before triggering XOFF. At 400 Gbps, this 3,000-byte buffer fills in 60 ns, so the XOFF threshold must be set significantly higher — typically 100-200 KB per priority per port — to allow sufficient time for the PAUSE frame to propagate (approximately 5 µs round-trip time on a 500-meter fiber link).
ETS bandwidth group quantization introduces another subtlety. Each traffic class receives a minimum bandwidth guarantee expressed as a percentage of the link capacity. However, the quantization of packet scheduling means that bandwidth allocation can only be approximated over a scheduling round. For a 400 Gbps link with priority 3 allocated 80% (320 Gbps) and priority 4 allocated 20% (80 Gbps), the scheduler must distribute packets in a 4:1 ratio over a round of 5 packet transmissions. With jumbo frames (9,000 bytes), one priority 4 packet takes 180 ns to transmit, during which four priority 3 packets (36,000 bytes) can be sent. Any deviation from this exact ratio accumulates as a bandwidth error over the round.
Tuning PFC thresholds requires balancing buffer utilization against HoL blocking risk. A commonly deployed heuristic sets XOFF at 70% of the per-priority per-port buffer and XON at 30%. This ensures that the sender has sufficient time to react (the 70% to 100% buffer headroom provides approximately 5-10 µs of absorption at 400 Gbps) while minimizing the fraction of buffer consumed by paused traffic. PFC watchdog timers (default 100 ms) detect stuck flows by measuring sustained PAUSE duration and automatically disable PFC on the offending port after three consecutive watchdog firings.
Per-Priority Flow Control Deadlock Detection and Recovery
Priority Flow Control (PFC) deadlocks are the single most destructive failure mode in RoCE-based AI fabrics. A PFC deadlock occurs when two switches are each waiting for the other to drain their receive buffers before resuming transmission, creating a circular dependency that freezes all traffic on the affected priorities. Unlike TCP's timeout-triggered retransmission, PFC has no built-in deadlock recovery mechanism — once paused, a port will remain paused until it receives an XON frame, which never arrives if both sides are waiting.
The root cause of PFC deadlocks is **Credit Starvation** in multi-hop topologies. Consider three switches A, B, C in a spine-leaf-leaf topology. Switch A (leaf, transmitting to leaf C via spine B) sends 9000-byte jumbo frames to B. Switch B's egress port to C becomes congested and B sends a PFC XOFF to A for priority 3. A stops transmitting priority 3 traffic to B — but A's egress buffer to B is now filled with priority 3 packets that cannot be drained because B's pause has frozen A's transmission. If A was also receiving priority 3 traffic from C (via B) at the same time, C's egress buffer to B fills with packets destined for A, and B sends XOFF to C. Now A is waiting on B, B is waiting on C, and C is waiting on A — a classic circular wait that no switch can resolve independently.
Modern data center switches implement **PFC Deadlock Detection (PDD)** using a watchdog timer per priority per port. If a port remains in the paused state (XOFF received, no XON received) for longer than the **PFC Deadlock Detection Timer** (default 250 ms on Spectrum-4), the switch assumes a deadlock and enters **Deadlock Recovery Mode**. The recovery mode forcibly flushes the paused port's receive buffer — dropping all packets in the buffer regardless of their priority or destination. After flushing, the switch sends an XON frame to the neighbor, breaking the circular dependency. The dropped packets are recovered by the RDMA transport layer (Go-Back-N retransmission), which is slower than PFC drop-free operation but far better than an infinite freeze.
The PDD timer value is a critical tuning parameter. If set too short (below 50 ms), normal congestion events during All-Reduce — where a port may legitimately remain paused for 20-30 ms — trigger false deadlock detection, causing unnecessary packet drops that degrade throughput by 5-10%. If set too long (above 1 second), a genuine deadlock freezes the fabric for a full second, during which 50,000+ training iterations are stalled. The recommended value for AI training fabrics is 500 ms — long enough to distinguish between normal congestion and deadlock, short enough to limit training disruption to under 500 milliseconds. Coupled with randomized CNP jitter (which prevents the synchronous rate reductions that lead to deadlocks), the PDD mechanism reduces deadlock incidence to fewer than one event per 10,000 hours of cluster operation in well-tuned 800G fabrics.
