In a Nutshell

TCP is designed to be 'polite'—it probes the network for available capacity and backs off when it encounters congestion. However, different algorithms use different triggers. This article compares traditional loss-based algorithms like CUBIC with Google's newer delay-based BBR (Bottleneck Bandwidth and Round-trip propagation time).

The Congestion Window (CWND)

Regardless of the algorithm, TCP uses a Congestion Window to limit how many packets can be 'in flight'—sent but not yet acknowledged.

AllowedData=min(RWND,CWND)AllowedData = \min(RWND, CWND)

TCP Window Scaling

Stop-and-Wait vs. Sliding Window Pipeline

SERVER
CLIENT
Throughput940 Mbps
Latency Impact120ms RTT

Pipeline (Sliding Window): The server fills the "pipe" with packets. It doesn't wait for ACKs to send the next packet. As long as the window is open, data flows continuously.

The Receive Window (RWND) is the receiver's buffer advert. In the original 1981 TCP specification (RFC 793), this was limited to 16 bits (64 KB). In modern high-speed networks, this is a massive bottleneck.

1. Window Scaling Math (2n2^n)

To overcome the 64 KB limit, RFC 1323 introduced Window Scaling. It uses a bit shift count in the TCP options during the 3-way handshake.

EffectiveWindow=AdvertisedWindow×2ScaleFactorEffectiveWindow = AdvertisedWindow \times 2^{ScaleFactor}

With a maximum Scale Factor of 14, the TCP window can grow up to 1 GB (65,535×2141,073,725,44065,535 \times 2^{14} \approx 1,073,725,440 bytes). This is essential for filling the Bandwidth-Delay Product (BDP) on long-haul fiber links.

The Mathis Equation: The Theoretical Limit of TCP

In the presence of random packet loss (unrelated to congestion, such as bit errors on fiber or wireless interference), TCP throughput is mathematically bounded by the Mathis Equation. This model assumes a Reno-style sawtooth behavior but provides a critical forensic baseline for all loss-based algorithms.

ThroughputMSSRTTp32Throughput \le \frac{MSS}{RTT \cdot \sqrt{p}} \cdot \sqrt{\frac{3}{2}}
  • p:The packet loss probability (0.01 = 1% loss)
  • MSS:Maximum Segment Size (typically 1460 bytes)

"Forensic Note: If your throughput is lower than the Mathis limit, you likely have a 'Window Limit' or an MTU black hole. If it matches, your bottleneck is the physical loss rate of the medium."

The difference between congestion algorithms lies in how they grow and shrink the Congestion Window (CWND) in response to the network's feedback loop.

2. CUBIC: Growing until Failure

CUBIC is a loss-based algorithm and the current default in Linux. It increases the window according to a cubic function of the time since the last congestion event (tt).

W(t)=C(tK)3+WmaxW(t) = C \cdot (t - K)^3 + W_{max}
  • WmaxW_{max}: The window size before the last reduction.
  • KK: The time period it takes to reach WmaxW_{max} again.
  • CC: A scaling constant (typically 0.4).

2.1. The Math of KK (Convergence Time)

CUBIC calculates the time KK it should take to reach the previous WmaxW_{max} using the following derivation:

K=WmaxβC3K = \sqrt[3]{\frac{W_{max} \cdot \beta}{C}}

Where β\beta is the multiplicative decrease factor (typically 0.7 or 0.8 in modern kernels). This ensures that the time to recover from a drop is independent of the RTT, preventing the "RTT Fairness" problem where short-latency flows starve long-latency flows.

The 1986 Congestion Collapse

"In October 1986, the throughput from LBL to UC Berkeley (a 400-yard distance) dropped from 32 Kbps to 40 bps. This was the first documented Congestion Collapse. Van Jacobson's subsequent implementation of Slow Start and Congestion Avoidance in BSD Unix saved the early internet from a complete death spiral."

3. BBR: Bottleneck Bandwidth & RTT

Developed by Google, BBR is model-based. Unlike CUBIC, which reacts to loss, BBR attempts to find the Kleinrock optimal operating point where the delivery rate is maximized and the round-trip time is minimized.

The BDP Control Loop

TargetValue=BWmax×RTTmin×GainTargetValue = BW_{max} \times RTT_{min} \times Gain

BBR maintains a moving windowed max of delivery rate and a moving windowed min of RTT. It regulates the pacing rate and the congestion window to match the physical capacity of the pipe.

BBR iterates through four states to maintain its model:

  • Startup: Exponentially increases pacing rate until delivery rate plateaus (similar to Slow Start).
  • Drain: Lowers the rate to drain any queues built up during Startup.
  • ProbeBW: Most steady-state time. It cycles its gain (e.g., 1.25, 0.75, 1.0) to probe for bandwidth while clearing potential queues.
  • ProbeRTT: Every 10 seconds, it reduces the window to just 4 packets to measure the true physical RTTminRTT_{min}.

4. Loss Recovery Forensics: Beyond Duplicate ACKs

Modern TCP stacks no longer rely solely on the simple "3 Duplicate ACKs" rule from the 1980s. Two critical technologies have revolutionized how TCP handles packet loss in high-stakes environments:

RACK-TLP (Recent ACK)

RACK uses time instead of sequence numbers to detect loss. It looks at the arrival time of ACKs to infer that a missing packet is lost if a later packet was acknowledged more than a "reordering window" ago. This makes TCP much more resilient to reordered packets in multipath (ECMP) networks.

Tail Loss Probe (TLP)

When a flow ends or drops packets at the very end of a window, there are no subsequent packets to trigger "Duplicate ACKs." This often leads to a multi-second RTO (Retransmission Timeout). TLP sends a "probe" packet to force an ACK, turning a 1-second timeout into a 10ms recovery.

5. Kernel-Level Tuning: The Data Plane

If you are engineering for 100G or 400G NICs, the default Linux kernel settings will throttle you. You must tune the Socket Buffer Memory limits to accommodate the massive BDP.

# /etc/sysctl.conf - Tuning for 100G LFN
net.core.rmem_max = 134217728 # 128MB
net.core.wmem_max = 134217728
net.ipv4.tcp_rmem = 4096 87380 134217728
net.ipv4.tcp_wmem = 4096 65536 134217728
net.ipv4.tcp_congestion_control = bbr
net.core.default_qdisc = fq # Required for BBR pacing

6. Case Study: The "LFN" Satellite Challenge

Consider a LEO satellite link (e.g., Starlink) with a bandwidth of 200 Mbps and an RTT of 40ms.

  • BDP Calculation: 200×106×0.040/8=1,000,000200 \times 10^6 \times 0.040 / 8 = 1,000,000 bytes (1 MB).
  • The Forensic Problem: Rain fade causes a burst loss of 1%.
  • CUBIC Reaction: Detects loss, cuts window by 20%, takes KK seconds to recover. Throughput drops to 40\approx 40 Mbps.
  • BBR Reaction: Detects that RTT is stable and delivery rate is unchanged by the burst. It maintains the 200 Mbps pacing rate.

In this forensic scenario, BBR provides a **5x throughput advantage** by correctly identifying that the loss was "stochastic" (noise) rather than "structural" (congestion).

Optimization is not a one-size-fits-all process. Understanding the physical constraints of your path—latency, jitter, and buffer depth—is the first step toward effective transport layer engineering.

Choosing the right congestion algorithm is a critical step in Reliability Engineering for long-distance data centers.

Kernel Bypass: DPDK, XDP, and the 10M PPS Frontier

The Linux kernel's networking stack, while robust, introduces 5-15 microseconds of latency per packet due to interrupt handling, socket buffer copies, and context switches. At 10 million packets per second (Mpps), this kernel tax becomes the dominant bottleneck, consuming multiple CPU cores just to process interrupts. Kernel bypass eliminates this overhead by allowing the application to read packets directly from the NIC's ring buffer without kernel intervention. The two dominant frameworks are DPDK (Data Plane Development Kit) and XDP (eXpress Data Path). DPDK polls the NIC in user space, achieving 80-100 Mpps on a single x86 core with zero-copy buffer management:

PPSmax=FcoreCpoll+CprocessPPS_{max} = \frac{F_{core}}{C_{poll} + C_{process}}
F_{core}CPU core frequency (Hz)
C_{poll}Cycles per packet for DMA descriptor polling (40-80 cycles)
C_{process}Cycles per packet for application processing

XDP, by contrast, attaches a BPF (Berkeley Packet Filter) program directly to the NIC driver, running in the kernel's networking context before the SKB (Socket Buffer) allocation. This allows packet filtering, load balancing, and DDoS mitigation at line rate with minimal overhead. Cloudflare's XDP-based L4 load balancer processes 10 Mpps per core with 1-2 microsecond forwarding latency. The trade-off is programming complexity: DPDK requires dedicated CPU cores running at 100% poll loops, while XDP operates within the kernel's execution model.

Multipath TCP: Bandwidth Aggregation Across Disparate Paths

Multipath TCP (MPTCP, RFC 8684) allows a single TCP connection to use multiple network paths simultaneously, aggregating bandwidth and providing seamless failover. Unlike link aggregation (LAG), which operates at Layer 2 and requires the same physical media type, MPTCP works across heterogeneous interfaces: Wi-Fi + 5G, Ethernet + satellite, or any combination. The architecture adds a Path Manager sublayer below the TCP socket that creates and manages additional TCP subflows, each traversing a different network path. The key challenge is the Data Sequence Control: packets arriving out of order due to differing path delays must be reassembled in the correct application-level sequence:

Raggregate=iRiΔtimaxi(Δti)R_{aggregate} = \frac{\sum_{i} R_i \cdot \Delta t_i}{\max_i(\Delta t_i)}
R_iThroughput of subflow i
Delta t_iOne-way delay of subflow i

If subflow A has 10 Mbps at 10 ms delay and subflow B has 10 Mbps at 100 ms delay, the effective aggregate throughput is approximately 11 Mbps—not 20 Mbps—because the receiver's reassembly buffer must hold packets from the fast path until the slow path catches up. The MPTCP congestion control algorithm (LIA, OLIA, or wVegas) balances the sending rates across subflows such that the total throughput does not exceed what a single TCP flow would achieve on the best path, preventing MPTCP from being "unfair" to regular TCP flows. In the 2026 smartphone ecosystem, MPTCP is embedded in iOS and Android for seamless handover between cellular and Wi-Fi, reducing application-layer timeouts by 90% during mobility events.

Share Article

Technical Standards & References

REF [TCP-OPT]
IETF
TCP Performance Optimization
VIEW OFFICIAL SOURCE
REF [BBR]
Google
BBR Congestion Control
VIEW OFFICIAL SOURCE
Mathematical models derived from standard engineering protocols. Not for human safety critical systems without redundant validation.

Related Engineering Resources