RDMA Throughput & Efficiency Predictor
A high-fidelity modeler for simulating RDMA goodput based on link attributes, delay, and protocol overhead.
Configuration
Accounting for one-way, fire-and-forget.
Round-trip time for completion.
RDMA Efficiency Breakdown
Wire efficiency and CPU savings for RDMA Write operations.
Packets Required
17
Goodput (MB/s)
48496.8
CPU Cycles Saved
3,342,336
"RDMA eliminates CPU from the data path. No memcpy, no context switches, no kernel crossings."
1. The Zero-Copy Mandate: CPU-Bypass Physics
To understand RDMA throughput, one must first understand the bottleneck it removes: the OS Kernel. In standard TCP, the probability of reaching 400Gbps on a single flow is near zero because the CPU must process every interrupt. RDMA moves the entire transport logic—segmentation, retransmission, and flow control—into the **NIC ASIC**.
Hardware-Offload Metrics
Because the CPU is bypassed, the throughput is limited only by the **HBM/DRAM** bandwidth of the server and the **PCIe Link** speed. On an NVIDIA H100 node, the RDMA NIC (ConnectX-7) can pull data from GPU memory at 400Gbps with zero impact on the training workload.
2. The BDP Constraint: Data in the Pipe
The Bandwidth-Delay Product (BDP) defines the amount of data required to fill the "pipe." If the sender stops before the first ACK/Credit arrives, the link goes idle.
BDP Equation
In RDMA, the "Window Size" is controlled by **Credits**. If the receiving NIC doesn't have at least 500KB of buffer space dedicated to this flow, the throughput will drop. In large-scale AI fabrics, managing these credits per-QP (Queue Pair) is the "Art of Fabric Optimization."
As distance increases (e.g., between data center zones), the RTT grows. For a 1km fiber link, the BDP for a 800Gbps fabric exceeds 2MB. Most standard NICs default to smaller credit pools, causing a major performance collapse over "long" distances unless tuned.
3. Credit-Based Logic: The Silence of the Wire
Unlike TCP, which uses a "dropping" signal to manage speed (Congestion Avoidance), InfiniBand and RoCE use a deterministic **Credit-Based** system.
Credit Advertisement
The receiver tells the sender: "I have space for 100 packets." The sender decrements this counter for every packet sent. When it hits zero, it stops—immediately. No dropped packets, no retransmissions.
Throughput Saw-Tooth
If credits are returned slowly (due to CPU stall at the receiver), the sender pulses. This creates a "Saw-Tooth" throughput pattern that is common in fragmented RDMA workloads.
4. Hardware Limits: The PCIe Gen5 Bottleneck
The network link is rarely the true bound. Modern 400G and 800G NICs are constrained by the server's internal architecture.
- PCIe Bandwidth Stealing
A Gen5 x16 slot provides 63GB/s (unidirectional). A 400Gbps NIC eats 50GB/s of that. If your system is also using PCIe for NVMe storage or other accelerators, the "contention" on the PCIe root complex will drop RDMA throughput by 10-15% as the bus negotiates priorities.
- Memory Latency (CAS)
RDMA Write is only as fast as the receiver's memory can sink it. If multiple flows hit the same memory bank simultaneously (Incast), the memory controller becomes a bottleneck, forcing the NIC to withhold credits and slowing the entire fabric.
5. Industrial Forensics: Auditing the Goodput
To measure actual goodput, you must look at the application-layer TFLOPS vs. the line-rate bits.
Frequently Asked Questions
Technical Standards & References
Related Engineering Resources
"You are our partner in accuracy. If you spot a discrepancy in calculations, a technical typo, or have a field insight to share, don't hesitate to reach out. Your expertise helps us maintain the highest standards of reliability."
Contributors are acknowledged in our technical updates.
