TCP Congestion Control
The Balancing Act of Throughput
The Congestion Window (CWND)
Regardless of the algorithm, TCP uses a Congestion Window to limit how many packets can be 'in flight'—sent but not yet acknowledged.
TCP Window Scaling
Stop-and-Wait vs. Sliding Window Pipeline
Pipeline (Sliding Window): The server fills the "pipe" with packets. It doesn't wait for ACKs to send the next packet. As long as the window is open, data flows continuously.
The Receive Window (RWND) is the receiver's buffer advert. In the original 1981 TCP specification (RFC 793), this was limited to 16 bits (64 KB). In modern high-speed networks, this is a massive bottleneck.
1. Window Scaling Math ($2^n$)
To overcome the 64 KB limit, RFC 1323 introduced Window Scaling. It uses a bit shift count in the TCP options during the 3-way handshake.
With a maximum Scale Factor of 14, the TCP window can grow up to 1 GB ($65,535 \times 2^14 \approx 1,073,725,440$ bytes). This is essential for filling the Bandwidth-Delay Product (BDP) on long-haul fiber links.
The difference between congestion algorithms lies in how they grow and shrink the Congestion Window (CWND).
2. CUBIC: Growing until Failure
CUBIC is a loss-based algorithm and the current default in Linux. It increases the window according to a cubic function of the time since the last congestion event ($t$).
- : The window size before the last reduction.
- : The time period it takes to reach again.
- : A scaling constant (typically 0.4).
3. BBR: Bottleneck Bandwidth & RTT
Developed by Google, BBR is model-based. Unlike CUBIC, which reacts to loss, BBR attempts to find the Kleinrock optimal operating point where the delivery rate is maximized and the round-trip time is minimized.
The BDP Control Loop
BBR maintains a moving windowed max of delivery rate and a moving windowed min of RTT. It regulates the pacing rate and the congestion window to match the physical capacity of the pipe.
BBR iterates through four states to maintain its model:
- Startup: Exponentially increases pacing rate until delivery rate plateaus (similar to Slow Start).
- Drain: Lowers the rate to drain any queues built up during Startup.
- ProbeBW: Most steady-state time. It cycles its gain (e.g., 1.25, 0.75, 1.0) to probe for bandwidth while clearing potential queues.
- ProbeRTT: Every 10 seconds, it reduces the window to just 4 packets to measure the true physical .
4. High-Speed Tuning: SACK, TS, and Autotuning
Beyond congestion algorithms, several TCP options are essential for modern high-bandwidth connections.
4.1. Selective Acknowledgments (SACK)
Without SACK (RFC 2018), if one packet in a window is lost, the sender might have to retransmit the entire window. SACK allows the receiver to say "I got packets 1, 2, 4, and 5, but I missed 3," enabling precise retransmission.
4.2. TCP Timestamps (PAWS)
At Gigabit speeds, TCP sequence numbers (32-bit) can wrap around in seconds. Timestamps (RFC 1323) provide a protection against Wrapped Sequence Numbers (PAWS), ensuring that old packets from a previous wrap don't get accepted as new data.
Linux Autotuning
Modern Linux kernels use TCP Autotuning. They dynamically adjust tcp_rmem and tcp_wmem based on available system memory and connection BDP. Manual tuning is rarely needed unless you are dealing with extreme high-performance compute (HPC) environments.
ECN Ready
Explicit Congestion Notification (ECN) allows routers to mark packets instead of dropping them. When combined with BBR or CUBIC, this dramatically reduces retransmission overhead.
Optimization is not a one-size-fits-all process. Understanding the physical constraints of your path—latency, jitter, and buffer depth—is the first step toward effective transport layer engineering.
Choosing the right congestion algorithm is a critical step in Reliability Engineering for long-distance data centers.