How does RDMA achieve nearly 100% of line rate throughput?

RDMA achieves this through 'Hardware Offload'. Unlike TCP, where the OS kernel handles every packet (involving context switches and copies), RDMA uses the NIC to move data directly between application memory and the wire. This results in a 'CPU utilization' of nearly 0% even at 400Gbps, leaving more CPU cycles available for the actual application logic or AI model.

What is the Bandwidth-Delay Product (BDP) in RDMA?

BDP is the 'amount of data in flight' required to keep a link full. For a 400Gbps link with a 100μs round-trip time, the BDP is ~5MB. In RDMA, this is managed by 'Hardware Credits'. If the sender doesn't have enough credits (buffer space at the receiver), it will pause, causing throughput to drop below line rate.

Does cable length affect RDMA throughput?

Yes. Longer cables increase the propagation delay (roughly 5 nanoseconds per meter). While this doesn't reduce the theoretical bandwidth, it increases the 'BDP'. If your NIC's internal credit buffer (or 'Work Queue' depth) is too small, you won't be able to keep a long-distance link full. For campus-wide RDMA, specialized 'Long-Haul' configurations are mandatory.

What is the efficiency difference between RDMA Write and RDMA Send?

RDMA Write is 'one-sided,' meaning the CPU on the receiving side is never involved. RDMA Send is 'two-sided,' requiring the receiver's CPU to post a buffer before data can arrive. For massive bandwidth, RDMA Write is the standard because it removes the receiver's CPU latency from the critical path.

How does 'PCIe Lane Bottlenecking' impact RDMA?

An 800Gbps NDR InfiniBand NIC requires a PCIe Gen5 x16 slot to achieve its peak speed. If you plug a 400Gbps NIC into a PCIe Gen4 x16 slot, your throughput will be capped at ~200Gbps (the PCIe limit), regardless of the network speed.

BACK TO TOOLKIT

RDMA Throughput & Efficiency Predictor

A high-fidelity modeler for simulating RDMA goodput based on link attributes, delay, and protocol overhead.

Configuration

RDMA Verb

Link Speed400 Gbps

MTU Size4096 bytes

Payload Size64 KB

Pipeline Depth (WRs)16

Effective Throughput

387.97Gbps

Accounting for one-way, fire-and-forget.

Operation Latency

9.81µs

Round-trip time for completion.

RDMA Efficiency Breakdown

Wire efficiency and CPU savings for RDMA Write operations.

Wire Efficiency99.0%

Packets Required

Goodput (MB/s)

48496.8

CPU Cycles Saved

3,342,336

"RDMA eliminates CPU from the data path. No memcpy, no context switches, no kernel crossings."

1. The Zero-Copy Mandate: CPU-Bypass Physics

To understand RDMA throughput, one must first understand the bottleneck it removes: the OS Kernel. In standard TCP, the probability of reaching 400Gbps on a single flow is near zero because the CPU must process every interrupt. RDMA moves the entire transport logic—segmentation, retransmission, and flow control—into the **NIC ASIC**.

Hardware-Offload Metrics

Because the CPU is bypassed, the throughput is limited only by the **HBM/DRAM** bandwidth of the server and the **PCIe Link** speed. On an NVIDIA H100 node, the RDMA NIC (ConnectX-7) can pull data from GPU memory at 400Gbps with zero impact on the training workload.

TCP: 50%+ CPU Load

RDMA: <1% CPU Load

Latency: 1μs (RDMA)

Latency: 50μs (TCP)

2. The BDP Constraint: Data in the Pipe

The Bandwidth-Delay Product (BDP) defines the amount of data required to fill the "pipe." If the sender stops before the first ACK/Credit arrives, the link goes idle.

BDP Equation

BDP = \text{Bandwidth} \times \text{RTT}

Example: 400 Gbps \times 10 \mu s = 500 KB

In RDMA, the "Window Size" is controlled by **Credits**. If the receiving NIC doesn't have at least 500KB of buffer space dedicated to this flow, the throughput will drop. In large-scale AI fabrics, managing these credits per-QP (Queue Pair) is the "Art of Fabric Optimization."

As distance increases (e.g., between data center zones), the RTT grows. For a 1km fiber link, the BDP for a 800Gbps fabric exceeds 2MB. Most standard NICs default to smaller credit pools, causing a major performance collapse over "long" distances unless tuned.

3. Credit-Based Logic: The Silence of the Wire

Unlike TCP, which uses a "dropping" signal to manage speed (Congestion Avoidance), InfiniBand and RoCE use a deterministic **Credit-Based** system.

Credit Advertisement

The receiver tells the sender: "I have space for 100 packets." The sender decrements this counter for every packet sent. When it hits zero, it stops—immediately. No dropped packets, no retransmissions.

Throughput Saw-Tooth

If credits are returned slowly (due to CPU stall at the receiver), the sender pulses. This creates a "Saw-Tooth" throughput pattern that is common in fragmented RDMA workloads.

4. Hardware Limits: The PCIe Gen5 Bottleneck

The network link is rarely the true bound. Modern 400G and 800G NICs are constrained by the server's internal architecture.

PCIe Bandwidth Stealing
A Gen5 x16 slot provides 63GB/s (unidirectional). A 400Gbps NIC eats 50GB/s of that. If your system is also using PCIe for NVMe storage or other accelerators, the "contention" on the PCIe root complex will drop RDMA throughput by 10-15% as the bus negotiates priorities.
Memory Latency (CAS)
RDMA Write is only as fast as the receiver's memory can sink it. If multiple flows hit the same memory bank simultaneously (Incast), the memory controller becomes a bottleneck, forcing the NIC to withhold credits and slowing the entire fabric.

5. Industrial Forensics: Auditing the Goodput

To measure actual goodput, you must look at the application-layer TFLOPS vs. the line-rate bits.

Frequently Asked Questions

Technical Standards & References

Buntinas, J. et al. (ACM)

InfiniBand Performance Modeling: Analyzing Credit Flow Control

VIEW OFFICIAL SOURCE

NVIDIA Networking Engineering

NVIDIA: Scalable RDMA Performance for Llama-3 Training

VIEW OFFICIAL SOURCE

Google Cloud HPC

Throughput Analysis of RoCE v2 in Multi-Tier Leaf-Spine Fabrics

VIEW OFFICIAL SOURCE

IBTA

The Physics of Bandwidth-Delay Products in 800G Architectures

VIEW OFFICIAL SOURCE

Mathematical models derived from standard engineering protocols. Not for human safety critical systems without redundant validation.

Related Engineering Resources

Interactive Tool

RoCE v2 Overhead Calculator

Deep-dive into RDMA/UDP framing efficiency.

Interactive Tool

PCIe Bandwidth Calculator

Calculate bus-level constraints for AI nodes.

Interactive Tool

AI Job Completion (JCT) Predictor

Scale throughput to actual training wall-clock time.

Partner in Accuracy

"You are our partner in accuracy. If you spot a discrepancy in calculations, a technical typo, or have a field insight to share, don't hesitate to reach out. Your expertise helps us maintain the highest standards of reliability."

Contributors are acknowledged in our technical updates.

RDMA
Goodput.

In a Nutshell

RDMA Throughput & Efficiency Predictor

Configuration

RDMA Efficiency Breakdown

1. The Zero-Copy Mandate: CPU-Bypass Physics

Hardware-Offload Metrics

2. The BDP Constraint: Data in the Pipe

BDP Equation

3. Credit-Based Logic: The Silence of the Wire

Credit Advertisement

Throughput Saw-Tooth

4. Hardware Limits: The PCIe Gen5 Bottleneck

5. Industrial Forensics: Auditing the Goodput

Frequently Asked Questions

Technical Standards & References

Related Engineering Resources

RoCE v2 Overhead Calculator

PCIe Bandwidth Calculator

AI Job Completion (JCT) Predictor