GPUDirect RDMA: The Zero-Copy Protocol
The Traditional Bottleneck.
Ordinarily, data arriving on a Network Interface card (NIC) must first be copied into a **System RAM buffer** managed by the CPU. The CPU then must context-switch to copy that same data from System RAM into the **GPU Memory (HBM3)**.
This "Double Copy" is a disaster for performance. It consumes PCIe bandwidth twice, stresses the memory controller, and adds precious microseconds of latency. **GPUDirect RDMA** solves this by letting the NIC write directly to the GPU memory over the PCIe bus.
BAR1 Visibility
GPUDirect exposes the GPU's memory address space to the PCIe bus, allowing Peer-to-Peer (P2P) mapping with other devices like NICs or NVMe drives.
Latency Decay
Eliminating the OS and CPU context switches reduces internal node latency by 3-5 microseconds—a lifetime in high-frequency trading or AI training.
Full Payloads
Maximizes the efficiency of RDMA 'Zero-Copy' by transferring full, unfragmented tensor blocks directly to the GPU's register file buffers.
Technical Requirements.
OS Compatibility
Requires a kernel built with `CONFIG_PEER_DIRECT` and specialized drivers like NVIDIA PEER DIRECT (nv_peer_mem).
Hardware Affinity
The NIC and GPU must share the same PCIe root complex. Passing through an IOMMU or another CPU socket can kill the performance gain.
Related Protocols
How RDMA integrates with the global fabric.
