GPUDirect Storage (GDS): The Physics of Zero-Copy I/O
Eliminating the CPU Bottleneck in Post-Blackwell AI Infrastructure
In the era of Large Language Models (LLMs) and massive-scale scientific simulations, the network and compute units are often faster than the storage subsystem's ability to feed them. Traditional I/O involves a "bounce-buffer" paradox: data must be copied from NVMe to system memory (CPU DRAM) before it can move to the GPU memory.
**GPUDirect Storage (GDS)** is the hardware-accelerated answer to this inefficiency. By establishing a direct DMA (Direct Memory Access) path between storage controllers and GPU memory, GDS offloads the CPU, slashes latency, and allows for near-line-rate throughput on PCIe Gen4/Gen5 links.
Legacy I/O Path
- **Double Copy Latency**: Data is moved twice, doubling the energy and time spent in the PCIe fabric.
- **CPU Interrupt Storms**: Every I/O block requires a context switch and intensive CPU cycles to manage the transfer.
- **DRAM Bottleneck**: Sustained throughput is capped by the system's memory bandwidth and I/O concurrency.
GDS Direct-Path
- **Zero-Copy Logic**: Data travels directly from the NVMe controller to the GPU Bar space.
- **Minimal CPU Cycles**: The CPU acts only as a control-plane orchestrator, not a data-plane conduit.
- **Deterministic Latency**: Removing the bounce-buffer eliminates Jitter caused by DRAM contention.
The Bounce Buffer Burden.
To understand why GDS is necessary, one must understand the traditional Linux I/O stack. When a GPU kernel requests data from an NVMe drive, the standard `read()` system call follows a circuitous route. Data must first be moved from the storage device into the **Host Page Cache** (part of system RAM).
Because the GPU cannot directly access CPU-private page cache memory due to memory management constraints (I/O Virtual Address space differences), the data is then "bounced" to a pinned buffer in DRAM before being pulled across the PCIe root complex into GPU memory via `cudaMemcpy`.
The "CPU Tax" Breakdown
Moving data through the host stack triggers frequent interrupts and kernel/user space transitions, significantly loading the CPU cores.
The "bounce" consumes double the DRAM bandwidth, competing with the CPU's own compute operations and slowing down the system.
Data travels the same PCIe links twice—once to get to DRAM and once to get to the GPU—halving the effective efficiency of the fabric.
CUFILE: The Software Bridge.
GPUDirect Storage is enabled by the `libcufile.so` user-space library and the `nvidia-fs` kernel driver. Instead of standard POSIX `read()`, developers use the `cuFileRead()` API. This triggers a different behavior in the kernel:
- 1
Memory Pining
GDS pins the GPU memory buffer directly so that the I/O subsystem knows exactly where the physical memory address (Bus Address) resides in the fabric.
- 2
Direct Peer-to-Peer (P2P)
If the storage and GPU are on the same PCIe switch hierarchy, the NVMe controller performs a DMA write directly into the GPU's memory BAR space. The CPU stays in the control plane, purely for metadata and completion signals.
- 3
Remote GDS (RNIC)
For scaled-out clusters, GDS extends over the network. Using **NVMe-oF (NVMe over Fabrics)**, a remote storage node can DMA data directly into a GPU in a different rack via RDMA (RoCE or InfiniBand), bypassing both the local and remote CPUs.
Impact on AI Workflows
Faster Model Checkpointing
LLM training involves frequent "checkpoints" to save weights. With GDS, checkpointing time can be reduced by 3-5x, increasing the overall TFLOPS utilized for actual training by minimizing idle time.
Massive Data Loading
For vision-based AI or complex dataset preprocessing, loading billions of small files is a CPU-bound task. GDS allows the GPU and its high-bandwidth memory (HBM) to handle the data ingest directly.
NFS over RDMA vs. GDS.
A common point of confusion is the relationship between **NFS over RDMA** and GPUDirect Storage. While both leverage RDMA to bypass the network stack, they serve different layers of the I/O problem. NFS over RDMA optimizes the *transfer* of data between the storage server and the host client's memory. However, once the data arrives at the host, it still lands in the CPU DRAM (the page cache).
GDS completes the "last mile." It consumes the RDMA-delivered data and directs it into the GPU memory without the CPU ever "touching" the payload.
This combination is what enables the massive "Rail-Optimized" storage networks found in Blackwell and Hopper clusters, where storage throughput is treated with the same priority as the compute interconnect.
Simulating the Data Path
Checkpointing Physics.
In distributed LLM training (e.g., GPT-4 or Llama-3 clusters), **Checkpointing** is the most frequent storage operation. Every few hours, the entire state of the model (weights, optimizer states, gradients) is pushed to global storage to prevent data loss from a single node failure.
For a 70B parameter model in 16-bit precision, a single checkpoint can exceed **200 GB per node**. In a 1,024-node cluster, this is a **200 TB** write burst.
GDS Efficiency in Checkpointing
- Operation ModeDirect DMA Write (O_DIRECT)
- Host Interrupt LatencyReduced by 85%
- Total Checkpoint Time4x Faster (on Parallel FS)
- CPU Savings~50% cycle reclamation
By collapsing the checkpoint time from minutes to seconds, GDS increases the **Effective TFLOPS** of the cluster. Every second spent checkpointing is a second where H100 cores are spinning idle, losing millions of dollars in compute value.
Fabric Geometry: The PCIe Switch.
GDS performance is fundamentally limited by the physical topology of the PCIe fabric. In a modern AI server (like a DGX H100), the GPUs and NVMe drives are connected via a dedicated High-Speed **PCIe Switch** (e.g., PLX or Broadcom PEX).
Optimal GDS Paths
Peak GDS performance. Data moves from the NVMe port to the GPU port without ever requesting access to the CPU Root Complex. Latency is sub-microsecond.
Moderate performance. Data must traverse the upstream ports of one switch and down to another. Latency increases by ~400-600ns per hop.
"If the storage is physically distant from the GPU in the PCIe tree, the Root Complex becomes a bottleneck, and the 'bounce' effect is mitigated but the fabric congestion remains."
Storage Ecosystem Compatibility.
Not all filesystems are GDS-aware. To trigger a `cuFile` direct DMA, the underlying storage driver must implement specific NVIDIA-defined hooks.
Parallel File Systems
**Lustre, BeeGFS, and IBM Spectrum Scale (GPFS)**. These are the gold standards for GDS. They take advantage of distributed data-stripping to feed multiple GPUs simultaneously at PB/s aggregate rates.
Software-Defined Storage
**WEKA.io and VAST Data**. These modern stacks are built with GDS as a native citizen. WEKA, in particular, leverages its zero-copy architecture to outperform traditional Lustre in high-file-count AI workloads.
The Math of I/O Determinism.
Performance in GDS is not just about throughput; it is about **Predictability**. In synchronous training, a single "tail latency" event on one GPU's storage read can stall the entire 32K GPU cluster.
We model the total I/O latency ($L_{total}$) as: $L_{total} = L_{storage} + L_{fabric} + L_{stack} + L_{copy}$.
In traditional I/O, $L_{stack}$ (kernel overhead) and $L_{copy}$ (CPU memory move) are high and stochastic. GDS reduces $L_{stack}$ to almost zero and completely deletes $L_{copy}$. At 100GB/s, a 1GB read takes **10ms**. If the CPU "bounce" adds even **5ms** of jitter, that is a 50% performance penalty.
PCIe Gen5 Saturation
A x16 PCIe Gen5 slot has a theoretical peak of **63.04 GB/s**. For a server with 8 GPUs and 8 corresponding NVMe storage paths, the aggregate cluster ingest can exceed **500 GB/s per server**. Managing this without GDS would require 100% of a dual-socket Sapphire Rapids CPU just to shuffle bytes.
Throughput Benchmarks
"Benchmarks simulated on Gen5 x16 fabric with CUFILE v1.9+ and GDS-enabled NVMe-oF target."
Storage & GDS Encyclopedia.
CUFILE (libcufile.so)
The NVIDIA user-space library that provides the API for GDS. It handles buffer registration, path optimization, and fallback to legacy I/O if hardware constraints aren't met.
O_DIRECT
A Linux file flag that bypasses the host page cache. GDS requires O_DIRECT to ensure the host CPU doesn't try to intercept the data.
DMA (Direct Memory Access)
The ability of a hardware component (like an NVMe controller) to access system memory independently of the CPU.
NVMe-oF
A protocol that extends NVMe commands over networking fabrics like InfiniBand or Ethernet using RDMA, enabling remote GDS.
Pinned Memory
Memory that is locked into physical RAM and cannot be swapped to disk by the OS. Necessary for DMA operations to be safe.
PCIe BAR
Base Address Registers used by a device to map internal memory into host physical address space. GDS writes directly into the GPU's BAR.
IOVA
I/O Virtual Address. GDS handles the translation between Storage device IOVA and GPU memory space to ensure security and isolation.
Interrupt Coalescing
Reducing the rate of I/O interrupts sent to the CPU, allowing for higher efficiency in the storage driver during heavy throughput.
Zero-Copy
The elimination of redundant data copies within system memory. GDS enables "True Zero-Copy" from disk to high-bandwidth memory (HBM).
Parallel FS
Lustre or BeeGFS. Filesystems that strip data across multiple nodes to maximize aggregate throughput, essential for feeding 100GB/s GDS links.
Burst Buffer
A high-speed intermediate storage layer used to absorb massive checkpointing writes before trickling them down to slower permanent storage.
WEKA
A software-defined storage platform that utilizes a proprietary protocol to deliver unified GDS performance across cloud and edge.
Root Complex
The hub of the CPU's PCIe connectivity. GDS aims to keep data lower in the switch fabric, avoiding the Root Complex whenever possible.
Peer-to-Peer
Direct device-to-device communication on the PCIe bus. GDS is technically a specialized form of Storage-to-GPU Peer-to-Peer DMA.
Scatter-Gather
The ability of an I/O controller to read/write multiple non-contiguous memory segments in a single transactional burst.
IOMMU
Memory management unit for I/O. GDS requires a modern IOMMU with high-throughput translation buffers to avoid bottlenecking the address resolution.
Aggregate Ingest (Gen5 x16)
112.4 GB/s Peak.
Technical metrics extracted from CUFILE v1.9 benchmarks on H100 systems. Real-world performance subject to filesystem caching, PCIe switch oversubscription, and storage target latency targets.
© 2026 Pingdo Labs. Technical Reference Series No. 22.
