TCP Optimization: BBR vs. CUBIC

The Congestion Window (CWND)

Regardless of the algorithm, TCP uses a Congestion Window to limit how many packets can be 'in flight'—sent but not yet acknowledged.

AllowedData = \min(RWND, CWND)

TCP Window Scaling

Stop-and-Wait vs. Sliding Window Pipeline

SERVER

CLIENT

Throughput940 Mbps

Latency Impact120ms RTT

Pipeline (Sliding Window): The server fills the "pipe" with packets. It doesn't wait for ACKs to send the next packet. As long as the window is open, data flows continuously.

The Receive Window (RWND) is the receiver's buffer advert. In the original 1981 TCP specification (RFC 793), this was limited to 16 bits (64 KB). In modern high-speed networks, this is a massive bottleneck.

1. Window Scaling Math ( $2^n$ )

To overcome the 64 KB limit, RFC 1323 introduced Window Scaling. It uses a bit shift count in the TCP options during the 3-way handshake.

EffectiveWindow = AdvertisedWindow \times 2^{ScaleFactor}

With a maximum Scale Factor of 14, the TCP window can grow up to 1 GB ( $65,535 \times 2^{14} \approx 1,073,725,440$ bytes). This is essential for filling the Bandwidth-Delay Product (BDP) on long-haul fiber links.

The Mathis Equation: The Theoretical Limit of TCP

In the presence of random packet loss (unrelated to congestion, such as bit errors on fiber or wireless interference), TCP throughput is mathematically bounded by the Mathis Equation. This model assumes a Reno-style sawtooth behavior but provides a critical forensic baseline for all loss-based algorithms.

Throughput \le \frac{MSS}{RTT \cdot \sqrt{p}} \cdot \sqrt{\frac{3}{2}}

p:The packet loss probability (0.01 = 1% loss)
MSS:Maximum Segment Size (typically 1460 bytes)

"Forensic Note: If your throughput is lower than the Mathis limit, you likely have a 'Window Limit' or an MTU black hole. If it matches, your bottleneck is the physical loss rate of the medium."

The difference between congestion algorithms lies in how they grow and shrink the Congestion Window (CWND) in response to the network's feedback loop.

2. CUBIC: Growing until Failure

CUBIC is a loss-based algorithm and the current default in Linux. It increases the window according to a cubic function of the time since the last congestion event ( $t$ ).

W(t) = C \cdot (t - K)^3 + W_{max}

$W_{max}$ : The window size before the last reduction.
$K$ : The time period it takes to reach $W_{max}$ again.
$C$ : A scaling constant (typically 0.4).

2.1. The Math of $K$ (Convergence Time)

CUBIC calculates the time $K$ it should take to reach the previous $W_{max}$ using the following derivation:

K = \sqrt[3]{\frac{W_{max} \cdot \beta}{C}}

Where $\beta$ is the multiplicative decrease factor (typically 0.7 or 0.8 in modern kernels). This ensures that the time to recover from a drop is independent of the RTT, preventing the "RTT Fairness" problem where short-latency flows starve long-latency flows.

The 1986 Congestion Collapse

"In October 1986, the throughput from LBL to UC Berkeley (a 400-yard distance) dropped from 32 Kbps to 40 bps. This was the first documented Congestion Collapse. Van Jacobson's subsequent implementation of Slow Start and Congestion Avoidance in BSD Unix saved the early internet from a complete death spiral."

3. BBR: Bottleneck Bandwidth & RTT

Developed by Google, BBR is model-based. Unlike CUBIC, which reacts to loss, BBR attempts to find the Kleinrock optimal operating point where the delivery rate is maximized and the round-trip time is minimized.

The BDP Control Loop

TargetValue = BW_{max} \times RTT_{min} \times Gain

BBR maintains a moving windowed max of delivery rate and a moving windowed min of RTT. It regulates the pacing rate and the congestion window to match the physical capacity of the pipe.

BBR iterates through four states to maintain its model:

Startup: Exponentially increases pacing rate until delivery rate plateaus (similar to Slow Start).
Drain: Lowers the rate to drain any queues built up during Startup.
ProbeBW: Most steady-state time. It cycles its gain (e.g., 1.25, 0.75, 1.0) to probe for bandwidth while clearing potential queues.
ProbeRTT: Every 10 seconds, it reduces the window to just 4 packets to measure the true physical $RTT_{min}$ .

4. Loss Recovery Forensics: Beyond Duplicate ACKs

Modern TCP stacks no longer rely solely on the simple "3 Duplicate ACKs" rule from the 1980s. Two critical technologies have revolutionized how TCP handles packet loss in high-stakes environments:

RACK-TLP (Recent ACK)

RACK uses time instead of sequence numbers to detect loss. It looks at the arrival time of ACKs to infer that a missing packet is lost if a later packet was acknowledged more than a "reordering window" ago. This makes TCP much more resilient to reordered packets in multipath (ECMP) networks.

Tail Loss Probe (TLP)

When a flow ends or drops packets at the very end of a window, there are no subsequent packets to trigger "Duplicate ACKs." This often leads to a multi-second RTO (Retransmission Timeout). TLP sends a "probe" packet to force an ACK, turning a 1-second timeout into a 10ms recovery.

5. Kernel-Level Tuning: The Data Plane

If you are engineering for 100G or 400G NICs, the default Linux kernel settings will throttle you. You must tune the Socket Buffer Memory limits to accommodate the massive BDP.

# /etc/sysctl.conf - Tuning for 100G LFN

net.core.rmem_max = 134217728 # 128MB

net.core.wmem_max = 134217728

net.ipv4.tcp_rmem = 4096 87380 134217728

net.ipv4.tcp_wmem = 4096 65536 134217728

net.ipv4.tcp_congestion_control = bbr

net.core.default_qdisc = fq # Required for BBR pacing

6. Case Study: The "LFN" Satellite Challenge

Consider a LEO satellite link (e.g., Starlink) with a bandwidth of 200 Mbps and an RTT of 40ms.

BDP Calculation: $200 \times 10^6 \times 0.040 / 8 = 1,000,000$ bytes (1 MB).
The Forensic Problem: Rain fade causes a burst loss of 1%.
CUBIC Reaction: Detects loss, cuts window by 20%, takes $K$ seconds to recover. Throughput drops to $\approx 40$ Mbps.
BBR Reaction: Detects that RTT is stable and delivery rate is unchanged by the burst. It maintains the 200 Mbps pacing rate.

In this forensic scenario, BBR provides a **5x throughput advantage** by correctly identifying that the loss was "stochastic" (noise) rather than "structural" (congestion).

Optimization is not a one-size-fits-all process. Understanding the physical constraints of your path—latency, jitter, and buffer depth—is the first step toward effective transport layer engineering.

Choosing the right congestion algorithm is a critical step in Reliability Engineering for long-distance data centers.

Kernel Bypass: DPDK, XDP, and the 10M PPS Frontier

The Linux kernel's networking stack, while robust, introduces 5-15 microseconds of latency per packet due to interrupt handling, socket buffer copies, and context switches. At 10 million packets per second (Mpps), this kernel tax becomes the dominant bottleneck, consuming multiple CPU cores just to process interrupts. Kernel bypass eliminates this overhead by allowing the application to read packets directly from the NIC's ring buffer without kernel intervention. The two dominant frameworks are DPDK (Data Plane Development Kit) and XDP (eXpress Data Path). DPDK polls the NIC in user space, achieving 80-100 Mpps on a single x86 core with zero-copy buffer management:

PPS_{max} = \frac{F_{core}}{C_{poll} + C_{process}}

F_{core}CPU core frequency (Hz)

C_{poll}Cycles per packet for DMA descriptor polling (40-80 cycles)

C_{process}Cycles per packet for application processing

XDP, by contrast, attaches a BPF (Berkeley Packet Filter) program directly to the NIC driver, running in the kernel's networking context before the SKB (Socket Buffer) allocation. This allows packet filtering, load balancing, and DDoS mitigation at line rate with minimal overhead. Cloudflare's XDP-based L4 load balancer processes 10 Mpps per core with 1-2 microsecond forwarding latency. The trade-off is programming complexity: DPDK requires dedicated CPU cores running at 100% poll loops, while XDP operates within the kernel's execution model.

The Cache Coherency Tax

In DPDK deployments, the polling cores must access packet buffers stored in NIC-attached memory (typically via PCIe BAR). Each packet access traverses the PCIe bus, incurring approximately 200-300 nanoseconds of round-trip latency. If the data spans multiple cache lines (64 bytes each), the PCIe read transactions multiply. Modern 100G NICs use DDIO (Direct Data I/O) to prefetch packet data into the CPU's L3 cache, reducing the average access latency to under 100 nanoseconds. However, DDIO consumes L3 cache capacity—a 100G link at line rate (148 Mpps for 64-byte packets) can occupy 100+ MB of cache lines, evicting the application's working set. The optimal configuration balances DDIO region size against application memory footprint, often reserving dedicated NUMA nodes for network processing.

Multipath TCP: Bandwidth Aggregation Across Disparate Paths

Multipath TCP (MPTCP, RFC 8684) allows a single TCP connection to use multiple network paths simultaneously, aggregating bandwidth and providing seamless failover. Unlike link aggregation (LAG), which operates at Layer 2 and requires the same physical media type, MPTCP works across heterogeneous interfaces: Wi-Fi + 5G, Ethernet + satellite, or any combination. The architecture adds a Path Manager sublayer below the TCP socket that creates and manages additional TCP subflows, each traversing a different network path. The key challenge is the Data Sequence Control: packets arriving out of order due to differing path delays must be reassembled in the correct application-level sequence:

R_{aggregate} = \frac{\sum_{i} R_i \cdot \Delta t_i}{\max_i(\Delta t_i)}

R_iThroughput of subflow i

Delta t_iOne-way delay of subflow i

If subflow A has 10 Mbps at 10 ms delay and subflow B has 10 Mbps at 100 ms delay, the effective aggregate throughput is approximately 11 Mbps—not 20 Mbps—because the receiver's reassembly buffer must hold packets from the fast path until the slow path catches up. The MPTCP congestion control algorithm (LIA, OLIA, or wVegas) balances the sending rates across subflows such that the total throughput does not exceed what a single TCP flow would achieve on the best path, preventing MPTCP from being "unfair" to regular TCP flows. In the 2026 smartphone ecosystem, MPTCP is embedded in iOS and Android for seamless handover between cellular and Wi-Fi, reducing application-layer timeouts by 90% during mobility events.

Related Engineering Resources

Interactive Tool

TCP Window Size Calculator

Optimize for your network conditions

Technical Article

Bufferbloat Analysis

Understanding tail drop and active queue management

Engineering Knowledge Expansion

Performance

TCP Congestion Control

In a Nutshell

The Congestion Window (CWND)

TCP Window Scaling

1. Window Scaling Math ( $2^n$ )

The Mathis Equation: The Theoretical Limit of TCP

2. CUBIC: Growing until Failure

2.1. The Math of $K$ (Convergence Time)

The 1986 Congestion Collapse

3. BBR: Bottleneck Bandwidth & RTT

The BDP Control Loop

4. Loss Recovery Forensics: Beyond Duplicate ACKs

RACK-TLP (Recent ACK)

Tail Loss Probe (TLP)

5. Kernel-Level Tuning: The Data Plane

6. Case Study: The "LFN" Satellite Challenge

Kernel Bypass: DPDK, XDP, and the 10M PPS Frontier

Multipath TCP: Bandwidth Aggregation Across Disparate Paths

Related Engineering Resources

TCP Window Size Calculator

Bufferbloat Analysis

MTU & MSS Logic: Packet Sizing

Propagation Delay: Velocity Factor Physics

Bufferbloat Phenomenon: Congestion Spikes

Technical Standards & References

Related Engineering Resources

TCP Window Size Calculator

Bufferbloat Analysis

In a Nutshell

The Congestion Window (CWND)

TCP Window Scaling

1. Window Scaling Math (2n2^n2n)

The Mathis Equation: The Theoretical Limit of TCP

2. CUBIC: Growing until Failure

2.1. The Math of KKK (Convergence Time)

The 1986 Congestion Collapse

3. BBR: Bottleneck Bandwidth & RTT

The BDP Control Loop

4. Loss Recovery Forensics: Beyond Duplicate ACKs

RACK-TLP (Recent ACK)

Tail Loss Probe (TLP)

5. Kernel-Level Tuning: The Data Plane

6. Case Study: The "LFN" Satellite Challenge

Kernel Bypass: DPDK, XDP, and the 10M PPS Frontier

Multipath TCP: Bandwidth Aggregation Across Disparate Paths

Related Engineering Resources

TCP Window Size Calculator

Bufferbloat Analysis

MTU & MSS Logic: Packet Sizing

Propagation Delay: Velocity Factor Physics

Bufferbloat Phenomenon: Congestion Spikes

Technical Standards & References

Related Engineering Resources

TCP Window Size Calculator

Bufferbloat Analysis

1. Window Scaling Math ( $2^n$ )

2.1. The Math of $K$ (Convergence Time)