In a Nutshell

In the hyperscale era of distributed AI, the ability to predict "Effective Throughput" (Goodput) is the difference between a successful training run and a multi-million dollar idle-GPU catastrophe. RDMA (Remote Direct Memory Access) promises near-line-rate performance, but its actual output is governed by the rigid physics of **Bandwidth-Delay Product (BDP)**, the serialization overhead of **RETH/BTH headers**, and the internal bandwidth limits of the **PCIe bus**. This article provides a rigorous mathematical framework for predicting RDMA throughput across various fabric topologies, from intra-rack NVLink to global routed RoCE v2.

BACK TO TOOLKIT

RDMA Throughput & Efficiency Predictor

A high-fidelity modeler for simulating RDMA goodput based on link attributes, delay, and protocol overhead.

Configuration

Effective Throughput
387.97Gbps

Accounting for one-way, fire-and-forget.

Operation Latency
9.81µs

Round-trip time for completion.

RDMA Efficiency Breakdown

Wire efficiency and CPU savings for RDMA Write operations.

Wire Efficiency99.0%

Packets Required

17

Goodput (MB/s)

48496.8

CPU Cycles Saved

3,342,336

"RDMA eliminates CPU from the data path. No memcpy, no context switches, no kernel crossings."

Share Article

1. The Zero-Copy Mandate: CPU-Bypass Physics

To understand RDMA throughput, one must first understand the bottleneck it removes: the OS Kernel. In standard TCP, the probability of reaching 400Gbps on a single flow is near zero because the CPU must process every interrupt. RDMA moves the entire transport logic—segmentation, retransmission, and flow control—into the **NIC ASIC**.

Hardware-Offload Metrics

Because the CPU is bypassed, the throughput is limited only by the **HBM/DRAM** bandwidth of the server and the **PCIe Link** speed. On an NVIDIA H100 node, the RDMA NIC (ConnectX-7) can pull data from GPU memory at 400Gbps with zero impact on the training workload.

TCP: 50%+ CPU Load
RDMA: <1% CPU Load
Latency: 1μs (RDMA)
Latency: 50μs (TCP)

2. The BDP Constraint: Data in the Pipe

The Bandwidth-Delay Product (BDP) defines the amount of data required to fill the "pipe." If the sender stops before the first ACK/Credit arrives, the link goes idle.

BDP Equation

BDP=Bandwidth×RTTBDP = \text{Bandwidth} \times \text{RTT}
Example: 400 Gbps \times 10 \mu s = 500 KB

In RDMA, the "Window Size" is controlled by **Credits**. If the receiving NIC doesn't have at least 500KB of buffer space dedicated to this flow, the throughput will drop. In large-scale AI fabrics, managing these credits per-QP (Queue Pair) is the "Art of Fabric Optimization."

As distance increases (e.g., between data center zones), the RTT grows. For a 1km fiber link, the BDP for a 800Gbps fabric exceeds 2MB. Most standard NICs default to smaller credit pools, causing a major performance collapse over "long" distances unless tuned.

3. Credit-Based Logic: The Silence of the Wire

Unlike TCP, which uses a "dropping" signal to manage speed (Congestion Avoidance), InfiniBand and RoCE use a deterministic **Credit-Based** system.

Credit Advertisement

The receiver tells the sender: "I have space for 100 packets." The sender decrements this counter for every packet sent. When it hits zero, it stops—immediately. No dropped packets, no retransmissions.

Throughput Saw-Tooth

If credits are returned slowly (due to CPU stall at the receiver), the sender pulses. This creates a "Saw-Tooth" throughput pattern that is common in fragmented RDMA workloads.

4. Hardware Limits: The PCIe Gen5 Bottleneck

The network link is rarely the true bound. Modern 400G and 800G NICs are constrained by the server's internal architecture.

  • PCIe Bandwidth Stealing

    A Gen5 x16 slot provides 63GB/s (unidirectional). A 400Gbps NIC eats 50GB/s of that. If your system is also using PCIe for NVMe storage or other accelerators, the "contention" on the PCIe root complex will drop RDMA throughput by 10-15% as the bus negotiates priorities.

  • Memory Latency (CAS)

    RDMA Write is only as fast as the receiver's memory can sink it. If multiple flows hit the same memory bank simultaneously (Incast), the memory controller becomes a bottleneck, forcing the NIC to withhold credits and slowing the entire fabric.

5. Industrial Forensics: Auditing the Goodput

To measure actual goodput, you must look at the application-layer TFLOPS vs. the line-rate bits.

Frequently Asked Questions

Technical Standards & References

Buntinas, J. et al. (ACM)
InfiniBand Performance Modeling: Analyzing Credit Flow Control
VIEW OFFICIAL SOURCE
NVIDIA Networking Engineering
NVIDIA: Scalable RDMA Performance for Llama-3 Training
VIEW OFFICIAL SOURCE
Google Cloud HPC
Throughput Analysis of RoCE v2 in Multi-Tier Leaf-Spine Fabrics
VIEW OFFICIAL SOURCE
IBTA
The Physics of Bandwidth-Delay Products in 800G Architectures
VIEW OFFICIAL SOURCE
Mathematical models derived from standard engineering protocols. Not for human safety critical systems without redundant validation.

Related Engineering Resources

Partner in Accuracy

"You are our partner in accuracy. If you spot a discrepancy in calculations, a technical typo, or have a field insight to share, don't hesitate to reach out. Your expertise helps us maintain the highest standards of reliability."

Contributors are acknowledged in our technical updates.

Share Article