TCP Congestion Control
The Balancing Act of Throughput
The Congestion Window (CWND)
Regardless of the algorithm, TCP uses a Congestion Window to limit how many packets can be 'in flight'—sent but not yet acknowledged.
TCP Window Scaling
Stop-and-Wait vs. Sliding Window Pipeline
Pipeline (Sliding Window): The server fills the "pipe" with packets. It doesn't wait for ACKs to send the next packet. As long as the window is open, data flows continuously.
The Receive Window (RWND) is the receiver's buffer advert. In the original 1981 TCP specification (RFC 793), this was limited to 16 bits (64 KB). In modern high-speed networks, this is a massive bottleneck.
1. Window Scaling Math ()
To overcome the 64 KB limit, RFC 1323 introduced Window Scaling. It uses a bit shift count in the TCP options during the 3-way handshake.
With a maximum Scale Factor of 14, the TCP window can grow up to 1 GB ( bytes). This is essential for filling the Bandwidth-Delay Product (BDP) on long-haul fiber links.
The Mathis Equation: The Theoretical Limit of TCP
In the presence of random packet loss (unrelated to congestion, such as bit errors on fiber or wireless interference), TCP throughput is mathematically bounded by the Mathis Equation. This model assumes a Reno-style sawtooth behavior but provides a critical forensic baseline for all loss-based algorithms.
- p:The packet loss probability (0.01 = 1% loss)
- MSS:Maximum Segment Size (typically 1460 bytes)
"Forensic Note: If your throughput is lower than the Mathis limit, you likely have a 'Window Limit' or an MTU black hole. If it matches, your bottleneck is the physical loss rate of the medium."
The difference between congestion algorithms lies in how they grow and shrink the Congestion Window (CWND) in response to the network's feedback loop.
2. CUBIC: Growing until Failure
CUBIC is a loss-based algorithm and the current default in Linux. It increases the window according to a cubic function of the time since the last congestion event ().
- : The window size before the last reduction.
- : The time period it takes to reach again.
- : A scaling constant (typically 0.4).
2.1. The Math of (Convergence Time)
CUBIC calculates the time it should take to reach the previous using the following derivation:
Where is the multiplicative decrease factor (typically 0.7 or 0.8 in modern kernels). This ensures that the time to recover from a drop is independent of the RTT, preventing the "RTT Fairness" problem where short-latency flows starve long-latency flows.
The 1986 Congestion Collapse
"In October 1986, the throughput from LBL to UC Berkeley (a 400-yard distance) dropped from 32 Kbps to 40 bps. This was the first documented Congestion Collapse. Van Jacobson's subsequent implementation of Slow Start and Congestion Avoidance in BSD Unix saved the early internet from a complete death spiral."
3. BBR: Bottleneck Bandwidth & RTT
Developed by Google, BBR is model-based. Unlike CUBIC, which reacts to loss, BBR attempts to find the Kleinrock optimal operating point where the delivery rate is maximized and the round-trip time is minimized.
The BDP Control Loop
BBR maintains a moving windowed max of delivery rate and a moving windowed min of RTT. It regulates the pacing rate and the congestion window to match the physical capacity of the pipe.
BBR iterates through four states to maintain its model:
- Startup: Exponentially increases pacing rate until delivery rate plateaus (similar to Slow Start).
- Drain: Lowers the rate to drain any queues built up during Startup.
- ProbeBW: Most steady-state time. It cycles its gain (e.g., 1.25, 0.75, 1.0) to probe for bandwidth while clearing potential queues.
- ProbeRTT: Every 10 seconds, it reduces the window to just 4 packets to measure the true physical .
4. Loss Recovery Forensics: Beyond Duplicate ACKs
Modern TCP stacks no longer rely solely on the simple "3 Duplicate ACKs" rule from the 1980s. Two critical technologies have revolutionized how TCP handles packet loss in high-stakes environments:
RACK-TLP (Recent ACK)
RACK uses time instead of sequence numbers to detect loss. It looks at the arrival time of ACKs to infer that a missing packet is lost if a later packet was acknowledged more than a "reordering window" ago. This makes TCP much more resilient to reordered packets in multipath (ECMP) networks.
Tail Loss Probe (TLP)
When a flow ends or drops packets at the very end of a window, there are no subsequent packets to trigger "Duplicate ACKs." This often leads to a multi-second RTO (Retransmission Timeout). TLP sends a "probe" packet to force an ACK, turning a 1-second timeout into a 10ms recovery.
5. Kernel-Level Tuning: The Data Plane
If you are engineering for 100G or 400G NICs, the default Linux kernel settings will throttle you. You must tune the Socket Buffer Memory limits to accommodate the massive BDP.
6. Case Study: The "LFN" Satellite Challenge
Consider a LEO satellite link (e.g., Starlink) with a bandwidth of 200 Mbps and an RTT of 40ms.
- BDP Calculation: bytes (1 MB).
- The Forensic Problem: Rain fade causes a burst loss of 1%.
- CUBIC Reaction: Detects loss, cuts window by 20%, takes seconds to recover. Throughput drops to Mbps.
- BBR Reaction: Detects that RTT is stable and delivery rate is unchanged by the burst. It maintains the 200 Mbps pacing rate.
In this forensic scenario, BBR provides a **5x throughput advantage** by correctly identifying that the loss was "stochastic" (noise) rather than "structural" (congestion).
Optimization is not a one-size-fits-all process. Understanding the physical constraints of your path—latency, jitter, and buffer depth—is the first step toward effective transport layer engineering.
Choosing the right congestion algorithm is a critical step in Reliability Engineering for long-distance data centers.