The Signal Before the Stall

In a high-speed AI fabric, congestion is inevitable. When thousands of GPUs synchronize their model weights simultaneously (All-Reduce), switches experience massive bursts of traffic. **Explicit Congestion Notification (ECN)** is the protocol mechanism that prevents these bursts from filling switch buffers entirely, a phenomenon known as "Buffer Bloat" that causes massive latency spikes.

WRED Detection

Switches use Weighted Random Early Detection (WRED) to identify buffer growth. Instead of dropping, they change the ECN bits in the IP header to binary '11' (CE - Congestion Experienced).

CNP Notification

When a receiver sees a 'CE' marked packet, it sends a **Congestion Notification Packet (CNP)** back to the source NIC, requesting immediate rate reduction.

The DCQCN Algorithm Cycle

Loading Visualization...

For RoCE v2 fabrics, the industry standard is **DCQCN** (Data Center Quantized Congestion Control). It operates in a 4-step recurring cycle:

01

Marking Threshold (Kmin/Kmax)

The switch starts marking ECN bits when the buffer exceeds Kmin and increases probability until Kmax.

02

RC Feedbacks

The Receiver (Reaction Point) generates CNPs whenever it encounters CE bits in the RDMA stream.

03

Rate Reduction

The Source (Sender) processes CNPs and immediately throttles its transmission speed to prevent buffer overflow.

04

Rate Recovery

If no CNPs are received for a specific time window, the source incrementally ramps back up to full line-rate.

ECN Threshold Tuner

Tuning Kmin and Kmax is an art. Too aggressive and you sacrifice throughput. Too lax and you trigger PFC pause storms. Use our tuner to find the optimal DCQCN settings.

Managing 'Incast' Congestion

The most common cause of buffer bloat in AI clusters is **TCP/Incast**, where multiple senders transmit to a single receiver simultaneously. ECN is exceptionally good at handling Incast because:

  • Fine-Grained Throttling: Unlike PFC which pauses the whole link, ECN marks only the affected flows.
  • Zero-Packet Loss: By signaling at 50-80% buffer utilization, ECN prevents the buffer from ever reaching 100%, ensuring no packets are dropped.

Kmin/Kmax Calibration Curves and Alpha Scaling in DCQCN Rate Recovery

The DCQCN algorithm's effectiveness hinges on three parameters: Kmin, Kmax, and Pmax. Kmin defines the buffer occupancy threshold where ECN marking begins, Kmax defines where marking probability reaches 100%, and Pmax specifies the maximum marking probability at Kmax. The marking probability between Kmin and Kmax follows a linear ramp: P(mark) = Pmax x (Qlen - Kmin) / (Kmax - Kmin). At 800 Gbps, the buffer fills at a rate of 100 GB/s, meaning the ramp from Kmin to Kmax (typically set to Kmax = 3 x Kmin) takes only 2-5 µs depending on the configured buffer sizes.

The choice of Kmin directly determines the equilibrium buffer occupancy. If Kmin is set too low (below approximately 50 KB for a 400 Gbps link), the switch marks packets during normal micro-bursts that would not cause actual congestion, unnecessarily reducing throughput. If Kmin is set too high (above 200 KB), the buffer may overflow before the ECN feedback loop can react, triggering PFC as the emergency brake. The optimal Kmin for a DCQCN-configured fabric is typically 1.5x the Bandwidth-Delay Product (BDP) of the link. For a 400 Gbps link with 5 µs RTT, the BDP is 250 KB, giving a Kmin of approximately 375 KB.

The alpha scaling factor controls how aggressively the sender reduces its rate upon receiving CNP feedback. DCQCN defines alpha as a per-flow variable updated as: alpha = (1 - g) x alpha + g x (number_of_CNPs / total_packets) in each update window. The gain parameter g (typically 1/16 or 1/8) determines the smoothing. A higher g allows faster convergence but introduces rate oscillation. When a CNP is received, the sender rate is immediately halved: new_rate = rate x (1 - alpha / 2). The rate recovery phase increases the rate by approximately 5% per RTT if no CNPs are received, following a linear increase with a lower bound equal to the minimum rate (NIC configuration parameter).

Tuning the rate recovery pace is critical for AI training workloads. During All-Reduce, all flows burst simultaneously, triggering ECN marking across all ports. If recovery is too fast, the flows synchronize and oscillate between rate reduction and recovery, a phenomenon known as **flow synchronization collapse**. Randomizing the recovery start time by introducing a jitter of +/- one RTT breaks this synchronization and improves aggregate throughput by 15-20% in measured deployments.

ECN RTT Fairness and Flow Synchronization Collapse

ECN-based congestion control has a well-documented fairness problem: flows with shorter Round-Trip Times (RTTs) receive a disproportionately large share of the bandwidth because they detect and respond to congestion faster. In an AI fabric where GPU-to-GPU RTTs range from 2 microseconds (same rack) to 20 microseconds (cross-cluster), this RTT unfairness can cause severe bandwidth allocation imbalance. A same-rack flow will capture 3-5x more bandwidth than a cross-rack flow under the same DCQCN configuration, leading to straggler effects in All-Reduce operations that depend on the slowest flow completing first.

The RTT fairness problem is rooted in the DCQCN rate adaptation formula. When a CNP is received, the sender rate is halved: new_rate = rate x (1 - alpha/2). The short-RTT flow receives CNP feedback faster and can begin recovery sooner. After the rate reduction, both flows enter the recovery phase where the rate is increased by approximately 5% per RTT. The short-RTT flow experiences more recovery cycles per unit time, allowing it to regain its share faster. Over a 10-millisecond window, a 2-microsecond RTT flow experiences 100 recovery cycles while a 20-microsecond RTT flow experiences only 10 cycles — the short flow's rate remains at 85% of peak while the long flow's rate oscillates between 40% and 60%.

The mitigation is **RTT-Independent Rate Recovery (RIR)**, where the recovery rate increase is tied to a global timer rather than the RTT. Instead of increasing the rate by 5% per RTT, the sender increases the rate by a fixed amount (e.g., 1 Gbps) every 100 microseconds, regardless of RTT. This decouples recovery speed from path latency and ensures that all flows converge to the same equilibrium rate. NVIDIA's Spectrum-4 switch supports RIR through the **DCQCN Extended** mode, which includes an RTT measurement field in the CNP packet that allows the sender to normalize its recovery rate against the measured path RTT.

Flow synchronization collapse is a separate but related phenomenon. When multiple flows share the same bottleneck and simultaneously detect congestion, they reduce their rates in phase, causing the bottleneck queue to drain completely before all flows ramp back up simultaneously, recreating the congestion in a repeating cycle. This synchronized oscillation reduces aggregate throughput by 20-30%. The standard fix is **Randomized CNP Jitter** — each sender waits for a random period (0 to 50 microseconds) before processing a CNP, effectively de-correlating the rate reduction timing across flows. With jitter enabled, the bottleneck utilization smooths from a 30% amplitude oscillation to under 5%, restoring near-theoretical throughput.

Share Article

Technical Standards & References

REF [rfc-3168]
RFC 3168 (2001)
The Addition of Explicit Congestion Notification (ECN) to IP
Published: IETF
VIEW OFFICIAL SOURCE
REF [roce-dcqcn-2015]
Mellanox/Microsoft (2015)
DCQCN: Data Center Quantized Congestion Control
Published: SIGCOMM
VIEW OFFICIAL SOURCE
Mathematical models derived from standard engineering protocols. Not for human safety critical systems without redundant validation.