GPUDirect Storage (GDS): The Physics of Zero-Copy I/O
Eliminating the CPU Bottleneck in Post-Blackwell AI Infrastructure
In the era of Large Language Models (LLMs) and massive-scale scientific simulations, the network and compute units are often faster than the storage subsystem's ability to feed them. Traditional I/O involves a "bounce-buffer" paradox: data must be copied from NVMe to system memory (CPU DRAM) before it can move to the GPU memory.
**GPUDirect Storage (GDS)** is the hardware-accelerated answer to this inefficiency. By establishing a direct DMA (Direct Memory Access) path between storage controllers and GPU memory, GDS offloads the CPU, slashes latency, and allows for near-line-rate throughput on PCIe Gen4/Gen5 links.
Legacy I/O Path
- **Double Copy Latency**: Data is moved twice, doubling the energy and time spent in the PCIe fabric.
- **CPU Interrupt Storms**: Every I/O block requires a context switch and intensive CPU cycles to manage the transfer.
- **DRAM Bottleneck**: Sustained throughput is capped by the system's memory bandwidth and I/O concurrency.
GDS Direct-Path
- **Zero-Copy Logic**: Data travels directly from the NVMe controller to the GPU Bar space.
- **Minimal CPU Cycles**: The CPU acts only as a control-plane orchestrator, not a data-plane conduit.
- **Deterministic Latency**: Removing the bounce-buffer eliminates Jitter caused by DRAM contention.
The Bounce Buffer Burden.
To understand why GDS is necessary, one must understand the traditional Linux I/O stack. When a GPU kernel requests data from an NVMe drive, the standard `read()` system call follows a circuitous route. Data must first be moved from the storage device into the **Host Page Cache** (part of system RAM).
Because the GPU cannot directly access CPU-private page cache memory due to memory management constraints (I/O Virtual Address space differences), the data is then "bounced" to a pinned buffer in DRAM before being pulled across the PCIe root complex into GPU memory via `cudaMemcpy`.
The "CPU Tax" Breakdown
Moving data through the host stack triggers frequent interrupts and kernel/user space transitions, significantly loading the CPU cores.
The "bounce" consumes double the DRAM bandwidth, competing with the CPU's own compute operations and slowing down the system.
Data travels the same PCIe links twice—once to get to DRAM and once to get to the GPU—halving the effective efficiency of the fabric.
CUFILE: The Software Bridge.
GPUDirect Storage is enabled by the `libcufile.so` user-space library and the `nvidia-fs` kernel driver. Instead of standard POSIX `read()`, developers use the `cuFileRead()` API. This triggers a different behavior in the kernel:
- 1
Memory Pining
GDS pins the GPU memory buffer directly so that the I/O subsystem knows exactly where the physical memory address (Bus Address) resides in the fabric.
- 2
Direct Peer-to-Peer (P2P)
If the storage and GPU are on the same PCIe switch hierarchy, the NVMe controller performs a DMA write directly into the GPU's memory BAR space. The CPU stays in the control plane, purely for metadata and completion signals.
- 3
Remote GDS (RNIC)
For scaled-out clusters, GDS extends over the network. Using **NVMe-oF (NVMe over Fabrics)**, a remote storage node can DMA data directly into a GPU in a different rack via RDMA (RoCE or InfiniBand), bypassing both the local and remote CPUs.
Impact on AI Workflows
Faster Model Checkpointing
LLM training involves frequent "checkpoints" to save weights. With GDS, checkpointing time can be reduced by 3-5x, increasing the overall TFLOPS utilized for actual training by minimizing idle time.
Massive Data Loading
For vision-based AI or complex dataset preprocessing, loading billions of small files is a CPU-bound task. GDS allows the GPU and its high-bandwidth memory (HBM) to handle the data ingest directly.
NFS over RDMA vs. GDS.
A common point of confusion is the relationship between **NFS over RDMA** and GPUDirect Storage. While both leverage RDMA to bypass the network stack, they serve different layers of the I/O problem. NFS over RDMA optimizes the *transfer* of data between the storage server and the host client's memory. However, once the data arrives at the host, it still lands in the CPU DRAM (the page cache).
GDS completes the "last mile." It consumes the RDMA-delivered data and directs it into the GPU memory without the CPU ever "touching" the payload.
This combination is what enables the massive "Rail-Optimized" storage networks found in Blackwell and Hopper clusters, where storage throughput is treated with the same priority as the compute interconnect.
Simulating the Data Path
Checkpointing Physics.
In distributed LLM training (e.g., GPT-4 or Llama-3 clusters), **Checkpointing** is the most frequent storage operation. Every few hours, the entire state of the model (weights, optimizer states, gradients) is pushed to global storage to prevent data loss from a single node failure.
For a 70B parameter model in 16-bit precision, a single checkpoint can exceed **200 GB per node**. In a 1,024-node cluster, this is a **200 TB** write burst.
GDS Efficiency in Checkpointing
- Operation ModeDirect DMA Write (O_DIRECT)
- Host Interrupt LatencyReduced by 85%
- Total Checkpoint Time4x Faster (on Parallel FS)
- CPU Savings~50% cycle reclamation
By collapsing the checkpoint time from minutes to seconds, GDS increases the **Effective TFLOPS** of the cluster. Every second spent checkpointing is a second where H100 cores are spinning idle, losing millions of dollars in compute value.
Fabric Geometry: The PCIe Switch.
GDS performance is fundamentally limited by the physical topology of the PCIe fabric. In a modern AI server (like a DGX H100), the GPUs and NVMe drives are connected via a dedicated High-Speed **PCIe Switch** (e.g., PLX or Broadcom PEX).
Optimal GDS Paths
Peak GDS performance. Data moves from the NVMe port to the GPU port without ever requesting access to the CPU Root Complex. Latency is sub-microsecond.
Moderate performance. Data must traverse the upstream ports of one switch and down to another. Latency increases by ~400-600ns per hop.
"If the storage is physically distant from the GPU in the PCIe tree, the Root Complex becomes a bottleneck, and the 'bounce' effect is mitigated but the fabric congestion remains."
Storage Ecosystem Compatibility.
Not all filesystems are GDS-aware. To trigger a `cuFile` direct DMA, the underlying storage driver must implement specific NVIDIA-defined hooks.
Parallel File Systems
**Lustre, BeeGFS, and IBM Spectrum Scale (GPFS)**. These are the gold standards for GDS. They take advantage of distributed data-stripping to feed multiple GPUs simultaneously at PB/s aggregate rates.
Software-Defined Storage
**WEKA.io and VAST Data**. These modern stacks are built with GDS as a native citizen. WEKA, in particular, leverages its zero-copy architecture to outperform traditional Lustre in high-file-count AI workloads.
The Math of I/O Determinism.
Performance in GDS is not just about throughput; it is about **Predictability**. In synchronous training, a single "tail latency" event on one GPU's storage read can stall the entire 32K GPU cluster.
We model the total I/O latency ($L_{total}$) as: $L_{total} = L_{storage} + L_{fabric} + L_{stack} + L_{copy}$.
In traditional I/O, $L_{stack}$ (kernel overhead) and $L_{copy}$ (CPU memory move) are high and stochastic. GDS reduces $L_{stack}$ to almost zero and completely deletes $L_{copy}$. At 100GB/s, a 1GB read takes **10ms**. If the CPU "bounce" adds even **5ms** of jitter, that is a 50% performance penalty.
PCIe Gen5 Saturation
A x16 PCIe Gen5 slot has a theoretical peak of **63.04 GB/s**. For a server with 8 GPUs and 8 corresponding NVMe storage paths, the aggregate cluster ingest can exceed **500 GB/s per server**. Managing this without GDS would require 100% of a dual-socket Sapphire Rapids CPU just to shuffle bytes.
Throughput Benchmarks
"Benchmarks simulated on Gen5 x16 fabric with CUFILE v1.9+ and GDS-enabled NVMe-oF target."
Storage & GDS Encyclopedia.
CUFILE (libcufile.so)
The NVIDIA user-space library that provides the API for GDS. It handles buffer registration, path optimization, and fallback to legacy I/O if hardware constraints aren't met.
O_DIRECT
A Linux file flag that bypasses the host page cache. GDS requires O_DIRECT to ensure the host CPU doesn't try to intercept the data.
DMA (Direct Memory Access)
The ability of a hardware component (like an NVMe controller) to access system memory independently of the CPU.
NVMe-oF
A protocol that extends NVMe commands over networking fabrics like InfiniBand or Ethernet using RDMA, enabling remote GDS.
Pinned Memory
Memory that is locked into physical RAM and cannot be swapped to disk by the OS. Necessary for DMA operations to be safe.
PCIe BAR
Base Address Registers used by a device to map internal memory into host physical address space. GDS writes directly into the GPU's BAR.
IOVA
I/O Virtual Address. GDS handles the translation between Storage device IOVA and GPU memory space to ensure security and isolation.
Interrupt Coalescing
Reducing the rate of I/O interrupts sent to the CPU, allowing for higher efficiency in the storage driver during heavy throughput.
Zero-Copy
The elimination of redundant data copies within system memory. GDS enables "True Zero-Copy" from disk to high-bandwidth memory (HBM).
Parallel FS
Lustre or BeeGFS. Filesystems that strip data across multiple nodes to maximize aggregate throughput, essential for feeding 100GB/s GDS links.
Burst Buffer
A high-speed intermediate storage layer used to absorb massive checkpointing writes before trickling them down to slower permanent storage.
WEKA
A software-defined storage platform that utilizes a proprietary protocol to deliver unified GDS performance across cloud and edge.
Root Complex
The hub of the CPU's PCIe connectivity. GDS aims to keep data lower in the switch fabric, avoiding the Root Complex whenever possible.
Peer-to-Peer
Direct device-to-device communication on the PCIe bus. GDS is technically a specialized form of Storage-to-GPU Peer-to-Peer DMA.
Scatter-Gather
The ability of an I/O controller to read/write multiple non-contiguous memory segments in a single transactional burst.
IOMMU
Memory management unit for I/O. GDS requires a modern IOMMU with high-throughput translation buffers to avoid bottlenecking the address resolution.
Aggregate Ingest (Gen5 x16)
112.4 GB/s Peak.
Technical metrics extracted from CUFILE v1.9 benchmarks on H100 systems. Real-world performance subject to filesystem caching, PCIe switch oversubscription, and storage target latency targets.
© 2026 Pingdo Labs. Technical Reference Series No. 22.
Barrierless I/O: The GDS Synchronization Model
One of the most misunderstood aspects of GPUDirect Storage is its synchronization model. Traditional I/O operations require the application to poll for completion or block on a system call. GDS replaces this with a **barrierless** model that integrates directly with the GPU's own scheduling hardware.
When a CUDA kernel needs data from storage, the programmer issues a **cuFileRead** or **cuFileWrite** call. This call does not block. Instead, the GPU driver creates a **GDS Descriptor** that specifies a source file offset, a destination GPU memory address, and a size. This descriptor is enqueued into a special **GDS Channel** that is separate from the standard CUDA stream pipeline. The GPU's **Work Scheduler** periodically polls the GDS Channel and, when it detects a completed I/O descriptor, marks the corresponding memory range as "valid" in the GPU's TLB (Translation Lookaside Buffer).
This approach eliminates the need for **cuCtxSynchronize** barriers between I/O and compute. A CUDA kernel can be launched immediately after a cuFileRead, and the GPU hardware will automatically stall the kernel's load instructions if they touch a cache line that has not yet arrived from the storage subsystem. This fine-grained **Cache-Line-Level Synchronization** is managed by the **MIG (Multi-Instance-GPU)** partition's L2 cache controller, which tracks the residency of each 256-byte sector.
The practical benefit is staggering: in a traditional pipeline, checkpointing a 175B-parameter model requires a global GPU barrier, draining all outstanding computation before the write can begin. With GDS, the HBM-to-storage transfer occurs in parallel with the next training step, because the GDS descriptor is processed by the NIC's DMA engine while the Tensor Cores continue their matmul operations. End-to-end checkpoint time drops from minutes to seconds, directly improving the Mean Time Between Checkpoints (MTBC) and allowing for more aggressive fault tolerance strategies.
CUFILE Read-Ahead Policies and Prefetch Window Tuning
The cuFile API that underpins GPUDirect Storage exposes fine-grained control over read-ahead and prefetch behavior through the **cuFileReadAhead** configuration parameter. Understanding and tuning this parameter is essential for achieving the theoretical throughput of GDS, because sequential read performance is dominated by how aggressively the driver pre-fetches data from storage into GPU HBM before the application requests it.
The read-ahead window is specified as a number of **Prefetch Chunks** — contiguous 16 MB regions that the driver speculatively submits to the storage subsystem. When an application issues a cuFileRead for byte range [X, X+N), the driver immediately submits speculative reads for chunks [X+N, X+N+W) where W is the prefetch window size (default 4 chunks = 64 MB). These speculative reads are issued asynchronously via the GDS channel while the application begins processing the requested data. If the application's next read falls within the prefetched range, the data is already resident in HBM and the read completes in under 1 microsecond instead of the full storage latency.
The prefetch window must be tuned based on the workload's access pattern. For sequential checkpoint reads (reading the entire model weights in order), a large prefetch window (32 chunks = 512 MB) achieves 97% prefetch hit rate, hiding the full storage latency behind computation. For random access patterns (loading individual expert weights in an MoE model), the prefetch hit rate drops below 30% because the access pattern is unpredictable. In this case, a smaller prefetch window (2 chunks = 32 MB) reduces wasted bandwidth from speculatively loaded data that is never used, improving effective throughput by 15-20% compared to the default 64 MB window.
The **cuFileReadAheadThrottle** parameter controls the maximum number of concurrent prefetch DMA operations. At the default value of 4, the driver can have up to 4 outstanding 16 MB reads in flight, consuming 64 MB of NIC-to-GPU bandwidth in the prefetch pipeline. Increasing the throttle to 8 doubles the prefetch bandwidth consumption but improves sequential read throughput by 22% on 400 Gbps fabrics. The tradeoff is 128 MB of HBM consumed by the prefetch buffer — meaningful for H100 GPUs with 80 GB of HBM but potentially problematic for H200 GPUs with 141 GB of HBM where the relative overhead is smaller. The optimal configuration for large-batch checkpoint workloads is a window of 16 chunks (256 MB) with a throttle of 16, providing 1.5 GB/s of sustained GDS read throughput per GPU — sufficient to load a 175B model checkpoint in under 2 minutes.
