RDMA Performance Optimization: Tuning for AI

Remote Direct Memory Access (RDMA) is the foundation of modern high-performance AI networking. By allowing one GPU to read or write directly to another GPU's memory across the network without involving either node's CPU, RDMA enables the near-line-rate performance required for large-scale GPU clusters. But "enabling" RDMA is only the first step. To get the last 15-20% of performance, you must tune the low-level hardware parameters of the HCA (Host Channel Adapter).

Key Tuning: Queue Pairs (QP)

RDMA uses **Queue Pairs (QPs)** to manage communication. Every connection has a "Send" and "Receive" queue. In large-scale AI clusters, managing the total number of QPs is critical. If you have too many QPs, you consume valuable HCA memory and hit the **QP Cache Limit**, which results in performance-killing cache misses for each I/O operation.

Performance Impact Matrix

Parameter	Typical Value	Effect
MTU (Payload)	4096 (RoCE) / 2K (IB)	Larger MTU reduces interrupt overhead at high rate.
PCIe Max Read Request	1024 / 4096 bytes	Crucial for filling the RDMA pipeline from the PCIe bus.
Adaptive Routing	Enabled	Prevents "Hot-Spot" congestion by re-routing traffic.
Relaxed Ordering	On	Improves PCIe throughput for multi-core processors.

Memory Registration & Pinning

Before RDMA can occur, the memory must be **"Pinned"** (registered with the HCA). This ensures the OS doesn't swap that memory out to disk while a DMA transfer is in progress. For large AI models, your pinned memory can easily exceed 80-90% of your total VRAM/DRAM.

Tools like **GPU Direct RDMA** solve this for GPU memory specifically, but host-side memory must be carefully managed to avoid kernel panics during registration of terabyte-scale addresses.

Strategic Recommendation

For RoCE-based clusters, **PFC (Priority Flow Control)** tuning is the #1 item on your checklist. Without it, the "Re-transmission" of dropped RDMA packets will kill your performance.

Learn About PFC Tuning