In a Nutshell

In everyday language, 'Bandwidth' is often used synonymously with 'Speed.' In network engineering, however, bandwidth is merely the potential, while throughput is the reality. This article deconstructs the physical and protocol-level barriers that prevent theoretical capacity from reaching the end-user, covering the Shannon-Hartley theorem, Bandwidth-Delay Product, protocol overhead, Bufferbloat, and measurement methodology.

The Capacity Gap: Theoretical Potential vs. Goodput Reality

In consumer marketing, "Bandwidth" is often sold as a synonym for "Speed." In high-performance networking, this is a dangerous oversimplification. Bandwidth is merely the **Spectral Capacity** of the medium—the width of the pipe—while **Throughput** is the actual volume of data that successfully traverses that medium over time.

The gap between the two is defined by the "Headers Tax," signal-to-noise dynamics, and the physics of the transport protocols. To understand why a 10Gbps link rarely delivers 10Gbps of application data, we must dissect the transmission into its constituent electromagnetic and protocol-level parts.

The Physical Limit: Shannon-Hartley Deep Dive

Every communication channel is bounded by the **Shannon-Hartley Theorem**, which defines the maximum amount of error-free information that can be transmitted over a bandwidth $B$ in the presence of noise $N$.

C=Blog2(1+SN)C = B \log_2(1 + \frac{S}{N})

This equation tells us that capacity $C$ is a function of both the **Spectral Width** (Hertz) and the **Signal-to-Noise Ratio (SNR)**. If you increase the frequency (Bandwidth) but the noise rises proportionally (common in copper cabling), your capacity remains flat. This is the fundamental constraint of **Physics Layer (L1) Mechanics**.

The Header Tax: Dissecting Protocol Efficiency

To move application data, it must be encapsulated. Every layer of the OSI model adds its own "tax." On a standard Gigabit Ethernet link, the efficiency is strictly capped by the architecture of the **Ethernet Frame**.

The Anatomy of an 1500B MTU Tax

LayerOverhead ComponentBytes
Layer 1 (L1)Preamble (7B) + SFD (1B) + Inter-Frame Gap (12B)20 Bytes
Layer 2 (L2)MAC Header (14B) + FCS/CRC (4B)18 Bytes
Layer 3 (L3)IPv4 Header (No options)20 Bytes
Layer 4 (L4)TCP Header (No options)20 Bytes

Total overhead is **78 bytes** per packet. For a 1500-byte MTU, we can calculate the **Maximum Theoretical Throughput Strategy**:

ηefficiency=PayloadPayload+Overhead=14601460+7894.9%\eta_{efficiency} = \frac{Payload}{Payload + Overhead} = \frac{1460}{1460 + 78} \approx 94.9\%

This means a 1Gbps link can **never** deliver more than **949Mbps** of application data, even with zero latency and zero interference. If you add VLAN tagging (802.1Q), you lose another 4 bytes per frame. If you use MPLS labels, you lose another 4 bytes per label.

The Long Fat Pipe: Bandwidth-Delay Product (BDP)

In a noise-free environment with zero header overhead, your throughput can still collapse due to the **Bandwidth-Delay Product (BDP)**. The BDP defines the "volume of the pipe"—the total amount of data that can be in flight between the sender and receiver at any given time.

BDP=Bandwidth×RTTBDP = Bandwidth \times RTT

The Satellite Link Trough

Consider a 10Gbps satellite link with a 600ms RTT. The BDP is:

10,000,000,000×0.6=6,000,000,000 bits (750 MB)10,000,000,000 \times 0.6 = 6,000,000,000 \text{ bits } (750 \text{ MB})

If the sender's **TCP Receive Window (RWIN)** is limited to the legacy default of 64KB, the sender will transmit 64KB and then stop, waiting 600ms for an ACK before sending the next 64KB. In this scenario, the effective throughput is a pathetic 853Kbps—rendering your 10Gbps link 99.9% idle.

To solve this, modern systems use **TCP Window Scaling (RFC 1323)**, allowing windows up to 1GB. However, even with scaling, a single packet loss on a "Long Fat Pipe" causes the TCP congestion window to cut in half, leading to a massive recovery time that drains throughput.

Goodput: The Application's Perspective

**Goodput** is the metric that actually matters to the CEO and the end-user. It is the rate at which useful, non-duplicate application data is delivered. It is always less than throughput because it excludes retransmissions and protocol overhead.

Goodput=Original Data SizeTotal Time to TransferGoodput = \frac{Original\ Data\ Size}{Total\ Time\ to\ Transfer}

The relationship between packet loss ($p$) and TCP throughput ($T$) is non-linear and brutal. According to the **Mathis Equation**, the maximum throughput of a TCP connection is inversely proportional to the square root of the loss rate:

TmaxMSSRTTp×CT_{max} \le \frac{MSS}{RTT \sqrt{p}} \times C

Where $C$ is a constant (~1.22). This equation proves that on a high-latency link (large $RTT$), even a tiny loss rate ($p=0.001$) can cap your throughput at a fraction of your bandwidth. This is why "Bandwidth" upgrades are useless for fixing throughput issues caused by L1/L2 instability.

The MTU Leverage: Reducing the Interrupt Storm

The **Maximum Transmission Unit (MTU)** is the largest packet or frame size, specified in octets, that can be sent in a single network transaction. Standard Ethernet uses an MTU of 1500 bytes. In high-throughput environments like SANs (Storage Area Networks) or AI Compute Clusters, this is often too small.

The "Jumbo" Advantage

By increasing the MTU to **9000 bytes (Jumbo Frames)**, you reduce the number of packets required to move the same amount of data by 6x. This reduces the "Header Tax" and, more importantly, reduces the number of **CPU Interrupts** processed by the network interface card (NIC).

The "Fragmentation" Trap

If an MTU mismatch occurs (e.g., a 9000B packet hits a 1500B router interface), the router must fragment the packet. This consumes CPU cycles and increases latency. If the "Don't Fragment" (DF) bit is set, the packet is simply dropped, leading to "ICMP Destination Unreachable" errors.

Engineering Encyclopedia

BDP (Bandwidth-Delay Product)

The total volume of data that can be "in flight" on a link, calculated as throughput multiplied by RTT.

CIR (Committed Information Rate)

The average rate of traffic that a provider guarantees will be delivered across their network.

Goodput

The quantity of useful information delivered per unit of time to a specific application, excluding protocol overhead and retransmissions.

IFG (Inter-Frame Gap)

The idle time between Ethernet frames (standard 96 bit times) required for receiver synchronization and L1 stability.

MSS (Maximum Segment Size)

The largest amount of data that a device can receive in a single TCP segment, usually MTU minus IP and TCP headers.

MTU (Maximum Transmission Unit)

The size of the largest protocol data unit (PDU) that can be communicated in a single network layer transaction.

Preamble

A sequence of bits (usually 56 bits) used to synchronize the clock of the receiver before the actual frame data arrives.

RWIN (Receive Window)

The amount of data a receiver is willing to buffer for a connection; acts as a buffer flow control mechanism.

Shannon Capacity

The theoretical maximum bit rate of a communication channel for a given noise level.

TCP Window Scaling

An option to increase the maximum allowed 16-bit window size field to 32 bits using a scale factor.

Throughput

The amount of data moved successfully from one place to another in a given time period.

Utilization

The percentage of the available bandwidth currently being used by traffic.

The Engineering Standard: RFC 6349 Methodology

Standard "Speed Tests" are virtually useless for infrastructure troubleshooting because they conflate application performance with link capacity. **RFC 6349** provides a rigorous framework for TCP throughput testing:

  • Step 1: Path MTU Discovery. Ensure the test is using the actual MTU of the path to avoid fragmentation overhead.
  • Step 2: Baseline RTT. Measure the round-trip time under zero load to calculate the ideal BDP.
  • Step 3: TCP Window Optimization. Force the host to use a window size $\ge BDP$.
  • Step 4: Concurrent Flows. Use enough parallel streams to saturate the ASIC pathways without causing congestion collapse.

Engineering Conclusion

Bandwidth is the road; throughput is the traffic that actually moves. Every physical characteristic of the network—noise, distance, cable quality—reduces your headroom from Shannon's theoretical limit. Every protocol layer adds an additional tax.

A master performance engineer does not "upgrade" until they have measured the **Goodput Efficiency**. If your efficiency is below 90%, you don't have a bandwidth problem; you have a protocol, windowing, or stability problem. Solving those is the difference between a technician and an engineer.

Advanced Queue Disciplines: The Router's Role in Throughput Collapse

Even with optimal physical-layer configuration and proper TCP window scaling, the router's queue discipline (qdisc) can single-handedly decimate throughput. The qdisc is the packet scheduling algorithm that determines which packet gets transmitted next when the output interface is congested. The default Linux qdisc, pfifo_fastpfifo\_fast, is a simple First-In-First-Out (FIFO) queue with three priority bands that lacks any active queue management (AQM). In a FIFO queue, when the buffer fills, newly arriving packets are simply dropped at the tail—a behavior known as Tail Drop—which causes TCP's congestion control algorithm to discover loss only after the buffer has already bloated to its maximum capacity. This fundamental mismatch between buffer sizing and TCP's window dynamics is the root cause of the Bufferbloat phenomenon.

The relationship between buffer size and throughput under Tail Drop is governed by the interaction between TCP's additive-increase-multiplicative-decrease (AIMD) algorithm and the buffer depth. The average queue occupancy in a Tail Drop system can be expressed as:

qavg=BsizeBDP2q_{avg} = B_{size} - \frac{BDP}{2}
B_{size}Buffer capacity in bytes
BDPBandwidth-Delay Product in bytes
q_{avg}Average queue occupancy in bytes

When BsizeBDPB_{size} \gg BDP, the buffer dominates the end-to-end latency, and throughput oscillates between the link rate during window growth and half the link rate after a loss event. The mean throughput under Tail Drop is approximately:

Tmean=Clink×(1BDP2Bsize)T_{mean} = C_{link} \times \left(1 - \frac{BDP}{2 \cdot B_{size}}\right)

This equation reveals a counterintuitive truth: increasing buffer size without bound does not increase throughput—it merely increases latency. Once the buffer exceeds the BDP, throughput asymptotically approaches the link rate, but at the cost of latency that grows linearly with buffer size. This is the Bufferbloat tradeoff that AQM algorithms aim to resolve.

Active Queue Management: From RED to CoDel

Active Queue Management (AQM) algorithms address Tail Drop pathology by dropping packets proactively to signal TCP before the buffer is completely full. The classic Random Early Detection (RED) algorithm computes a drop probability based on the exponentially weighted moving average of the queue depth:

pdrop={0qavgtminqavgtmintmaxtminpmaxtmin<qavg<tmax1qavgtmaxp_{drop} = \begin{cases} 0 & q_{avg} \leq t_{min} \\ \frac{q_{avg} - t_{min}}{t_{max} - t_{min}} \cdot p_{max} & t_{min} < q_{avg} < t_{max} \\ 1 & q_{avg} \geq t_{max} \end{cases}

RED smooths the congestion signal across flows, preventing the global TCP synchronization problem where all flows simultaneously detect loss and halve their windows, causing a throughput collapse. However, RED requires careful tuning of tmint_{min}, tmaxt_{max}, pmaxp_{max}, and the queue weight factor, making it fragile in heterogeneous environments. The CoDel (Controlled Delay) algorithm, introduced by Nichols and Jacobson in 2012, eliminates parameter tuning by measuring packet sojourn time rather than queue depth. CoDel maintains a target delay of 5ms and drops packets according to a square-root control law when the minimum sojourn time exceeds this target:

Tnext_drop=Tlast_drop+100mscountT_{next\_drop} = T_{last\_drop} + \frac{100ms}{\sqrt{count}}

The square-root control law is mathematically elegant: it responds aggressively to persistent congestion (where count grows and the inter-drop interval shrinks) while remaining transparent to brief microbursts that complete within the 100ms control interval. FQ-CoDel extends CoDel with per-flow queuing, ensuring that a single aggressive flow cannot starve others. Measurements from real-world deployments show that FQ-CoDel reduces the 99th percentile flow completion time for short flows by up to 80% compared to Tail Drop, while reducing bulk throughput by less than 5%—a critical improvement for mixed-traffic environments.

Hardware Offload and Qdisc Bypass

An increasingly critical concern in high-speed networking is that NIC hardware offloads (TSO/GRO) bypass the software qdisc entirely. When TCP Segmentation Offload (TSO) is active, the kernel passes super-sized segments (up to 64KB) to the NIC, which splits them into MTU-sized packets after the qdisc has made its scheduling decision. The AQM algorithm is effectively blind to individual packets, operating on TSO segments that are up to 44x larger than the actual wire packets. Recent kernel work has introduced "segmentation-aware" qdiscs that peek inside TSO segments, but these remain experimental. The throughput engineer must verify qdisc visibility by checking tcsqdiscshowdeveth0tc -s qdisc show dev eth0—if the dropped counter remains at zero under sustained load, the qdisc is likely being bypassed, and your throughput is being shaped solely by the NIC's internal ring buffer operating in Tail Drop mode.

Precision Throughput Measurement: From Wire to Application

Measuring throughput is deceptively simple: send data, measure time, divide. In practice, the methodology chosen determines whether the result reflects link capacity, protocol efficiency, or application performance—and conflating these three is the most common error in network troubleshooting. The International Telecommunication Union's Y.1564 standard defines a multi-layer testing framework that separates these concerns, beginning with Layer 2 throughput (the raw bit-carrying capacity of the medium) and progressing through Layer 3 (IP forwarding rate), Layer 4 (TCP goodput), and Layer 7 (application-level throughput).

Each layer introduces its own measurement artifacts. At Layer 2, the throughput calculation must account for the Inter-Frame Gap (IFG) of 96 bit times (12 bytes at 1Gbps) and the preamble (8 bytes), which are invisible to higher-layer tests. The maximum achievable L2 throughput on Ethernet is:

TL2=Cline×LframeLframe+20T_{L2} = C_{line} \times \frac{L_{frame}}{L_{frame} + 20}
20Preamble (8) + IFG (12) in bytes
C_{line}Line rate (e.g., 1,000,000,000 bps)
L_{frame}Frame size including FCS (64-1518 bytes)

This means that even for maximum-sized frames (1518 bytes), the L2 efficiency is only 98.7%, and for minimum-sized frames (64 bytes), it collapses to 76.2%. This is not a performance problem—it is a fundamental constraint of the Ethernet protocol. The iPerf3 tool, the de facto standard for network throughput testing, defaults to Layer 4 (TCP) measurement, which introduces additional overhead from headers, congestion control, and the three-way handshake. Running iPerf3 in UDP mode removes the congestion control variable but introduces packet loss as a measurement artifact, since UDP has no retransmission mechanism.

RFC 6349: The Standard for TCP Throughput Testing

Standard "speed tests" are virtually useless for infrastructure troubleshooting because they conflate application performance with link capacity. RFC 6349 provides a rigorous framework that isolates these variables through a four-step methodology. Step 1 performs Path MTU Discovery to ensure the test uses the actual path MTU, avoiding fragmentation overhead. Step 2 establishes a baseline RTT under zero load, which is used to compute the ideal BDP. Step 3 configures the TCP buffer (SO_RCVBUF/SO_SNDBUF) to be at least as large as the BDP. Step 4 runs multiple concurrent streams to saturate the link without causing congestion collapse—typically four to eight streams for a 10Gbps link, depending on the NIC's RSS (Receive Side Scaling) configuration.

The output of an RFC 6349 test produces a "Throughput Efficiency Ratio" and a "Buffer Delay Metric" that quantify how close the connection comes to theoretical performance:

ηRFC6349=TmeasuredBDP/RTT\eta_{RFC6349} = \frac{T_{measured}}{BDP / RTT}

A ratio below 0.9 indicates either buffer misconfiguration, excessive packet loss, or suboptimal window scaling. When the measured throughput deviates from the BDP-derived ideal, the root cause is almost never "not enough bandwidth" and almost always a protocol or configuration issue at L3 or L4.

Practical Tooling: iPerf3, ntttcp, and Application-Aware Monitoring

iPerf3 is the workhorse of TCP throughput testing, but its default parameters produce misleading results. The default test duration (10 seconds) is insufficient for high-BDP links, where TCP slow start may not reach steady state before the test ends. For a link with 200ms RTT, the TCP congestion window grows by one MSS per RTT during the congestion avoidance phase, requiring:

tsteady=BDPMSS×RTTt_{steady} = \frac{BDP}{MSS} \times RTT

For a 10Gbps link with 200ms RTT and 1460-byte MSS, tsteadyt_{steady} is approximately 171 seconds—nearly 3 minutes. A 10-second iPerf3 test would measure only the slow-start phase, reporting throughput that is 2-3x higher than sustainable. The t300-t 300 flag (5-minute test) is the minimum recommended duration for high-latency links.

On Windows platforms, Microsoft's ntttcp (NT TCP/IP Test) provides more granular control over buffer sizes, concurrent threads, and CPU affinity. It also reports per-connection CPU utilization, enabling the calculation of throughput-per-CPU-cycle efficiency—a critical metric in virtualized environments where CPU oversubscription constrains throughput independently of link capacity. For application-aware monitoring, NetFlow/IPFIX and sFlow provide sampled packet analysis that reveals throughput by application, source, and destination. The key metric from flow data is not peak throughput but the "95th percentile sustained rate," which determines whether the observed throughput pattern matches the application's expected profile or indicates congestion-induced throttling.

The final and most often overlooked measurement is the "Application Goodput Ratio"—the ratio of application-layer data to total bytes transmitted. This can be derived from flow data:

Gapp=BytespayloadBytestotal=ηL2×ηL3×ηL4×ηretransmitG_{app} = \frac{Bytes_{payload}}{Bytes_{total}} = \eta_{L2} \times \eta_{L3} \times \eta_{L4} \times \eta_{retransmit}

Where ηretransmit\eta_{retransmit} accounts for retransmitted bytes. A ratio below 0.85 suggests either excessive overhead (small packets, many connections) or a high retransmission rate, both of which are actionable engineering signals. Throughput is not a single number—it is a stack of nested measurements, and the engineer's art lies in knowing which layer to measure and how to interpret the result.

Share Article

Technical Standards & References

REF [ref-1]
C. E. Shannon (1949)
Communication in the presence of noise
Published: Proceedings of the IRE
VIEW OFFICIAL SOURCE
REF [ref-2]
B. Constantine et al. (2011)
RFC 6349: Framework for TCP Throughput Testing
Published: IETF
VIEW OFFICIAL SOURCE
REF [ref-3]
Ilya Grigorik (2013)
High Performance Browser Networking
Published: O'Reilly Media
Mathematical models derived from standard engineering protocols. Not for human safety critical systems without redundant validation.

Related Engineering Resources