Why is standard Ethernet insufficient for AI training?

Standard Ethernet is designed for reliability over speed, using TCP which has significant CPU overhead. AI training requires nanosecond-level latency and massive throughput to synchronize GPUs, making RDMA-based technologies like InfiniBand or RoCE v2 necessary.

What is RDMA and why does it matter?

RDMA (Remote Direct Memory Access) allows one computer to read/write directly to another computer's memory without involving the OS or CPU of either machine. This bypasses the 'TCP bottleneck' and dramatically reduces latency.

InfiniBand vs RoCE v2: Which is better for AI?

InfiniBand offers the lowest latency and is 'Lossless' natively, making it a favorite for Tier-1 AI labs. RoCE v2 (RDMA over Converged Ethernet) runs on standard Ethernet hardware, making it more cost-effective and easier to manage, though it requires complex configuration to achieve lossless behavior.

AI Networking Infrastructure: The GPU-Centric Fabric

The AI Revolution is a Network Revolution

When we talk about Artificial Intelligence, we focus on GPUs (Nvidia H100s, B200s). But a single GPU is useless for training a Large Language Model (LLM). Training requires *thousands* of GPUs to act as a single, unified computer. The "Glue" that makes this possible is the **Backend Network Fabric**.

In AI networking, standard enterprise rules don't apply. We don't care about "Reliability through Retransmission" (TCP); we care about "Zero-Packet-Loss" and "Nanosecond Latency." If a single packet is dropped in an AI cluster, the entire training job stops for milliseconds—costing thousands of dollars in wasted compute time.

AI Fabric Architecture

Data Flow Model

AI FABRIC ARCHITECTURE

Simulating High-Performance Backend Interconnects

H100 Node 01

CPU_IDLE

VRAM

KERNEL_BYPASS: OK

800 Gbps RDMA FABRIC

H100 Node 02

CPU_IDLE

VRAM

Dynamic Latency

0.8µs

JITTER_LOW

Fabric Status

CONVERGED

RDMA EngineActive (v2)

Flow ControlPFC/ECN Capable

"The transition from lossy to lossless networking is the single most expensive and critical step in AI infra design."

1. RDMA: Direct Memory Access

Standard networking (TCP/IP) is too slow for AI. The CPU has to spend too much time "thinking" about headers. **RDMA (Remote Direct Memory Access)** allows GPU A in Rack 1 to read data directly from the VRAM of GPU B in Rack 50 without involving the CPUs of either server.

Zero-Copy

Data doesn't need to be copied into multiple buffers, reducing latency and CPU cycles.

Kernel Bypass

The application talks directly to the Network Card (NIC), skipping the OS overhead.

2. The Two Contenders: InfiniBand vs. RoCE v2

InfiniBand

InfiniBand is a dedicated networking technology designed specifically for HPC. It is natively "Lossless"—the hardware itself ensures that no packet is ever dropped due to congestion.

Engineering Profile

Lowest Tail Latency
Highest Efficiency
Proprietary Ecosystem

RoCE v2

RoCE v2 wraps RDMA inside standard UDP/IP/Ethernet packets. This allows it to run on standard Ethernet hardware from any major vendor.

Engineering Profile

Multi-Vendor Silicon
Complex PFC/ECN Tuning
Cost-Effective Scale

3. Topology: Non-Blocking Fat-Trees

Standard networks use "Oversubscription" (assuming not everyone talks at once). AI assumes **everyone is talking at once, at full speed**. We use **Clos Topologies (Fat-Trees)** with a 1:1 oversubscription ratio.

Architect's Insight

This means every GPU has an unobstructed "Clear Path" to every other GPU at 400Gbps or 800Gbps. This requires a massive number of high-radix switches and a "Forest" of fiber optic cables.

The Future: 800G and Beyond

As LLMs grow from 175B parameters to 10T+, the network bandwidth must double every 18 months. We are already seeing the deployment of **800G OSFP** optics and the rise of **Optical Circuit Switching (OCS)**, where mirrors literally reflect laser beams to change network paths in real-time.

Conclusion: The Network is the Computer

We have entered the era where the network is no longer a utility; it is a core component of the compute engine. The engineers who can bridge the gap between "Distributed Systems" and "High-Speed Optics" are the ones who will build the infrastructure that powers the next generation of intelligence.

Infrastructure Engineering Series

Series Navigation
The Pillars of Technical Implementation

Infrastructure

Thermal Engineering

Direct Liquid Cooling (DLC) and rack-scale thermodynamics for 120kW+ density.

Explore Component

Infrastructure

Compute Benchmarking

H100 vs Blackwell architecture. Analyzing FP8/FP4 TFLOPS and memory scaling.

Explore Component

Network

Fabric Topology

Fat-Tree, Dragonfly, and rail-optimized networking architectures for GPU clusters.

Explore Component

Compute

Training Mechanics

Gradient synchronization, All-Reduce bottlenecks, and NCCL optimization patterns.

Explore Component

Flowlet Switching Granularity Tradeoffs

Flowlet switching is the predominant mechanism for load-balancing RDMA traffic across multi-path fabrics without requiring per-packet reordering buffers. A flowlet is a burst of packets from a single TCP or RDMA flow separated by an idle gap longer than a configurable threshold. When the gap exceeds this threshold, the switch may rehash the flowlet to a different path. The gap threshold is the critical tuning parameter and its optimal value depends on the round-trip time of the fabric.

In 800G AI fabrics with 280 ns per-hop switch latency and a 3-tier topology, the one-way latency between any two GPUs is approximately 1.4 μs (5 hops). The total round-trip time including NIC processing is roughly 5 μs. The flowlet gap must be set above this RTT to prevent the switch from rehashing a flowlet before the original path's packets have cleared the fabric. If the gap is set too low (e.g., 1 μs), the switch rehashes mid-transmission, causing packets from the same RDMA message to arrive on different destination ports. If the destination NIC does not support out-of-order reception — and most 2025-era ConnectX-7 NICs do not — this triggers a Go-Back-N retransmission, collapsing throughput by 60-80%. The safe lower bound for the gap threshold is 3× the fabric RTT, or 15 μs. However, setting the gap too high (above 100 μs) prevents the system from reacting to congestion within the same RDMA message.

The 2026 standard, as deployed in NVIDIA Spectrum-4 switches, uses adaptive flowlet gap tuning. Each switch port monitors the inter-arrival time of packets per flow using hardware timestamping. When congestion builds and queue depths exceed 50% of buffer capacity, the switch dynamically lowers the flowlet gap from 100 μs to 10 μs, forcing more aggressive rebalancing. When congestion subsides, the gap relaxes to prevent excessive out-of-order delivery. This adaptive scheme achieves 95% fabric utilization in production clusters, compared to 72% with a static 50 μs gap and 62% with pure ECMP hashing.

ECN-Integrated Credit-Based Flow Control for Converged Fabrics

The convergence of InfiniBand's credit-based flow control with Ethernet's ECN marking represents the frontier of AI fabric design. Pure credit-based flow control (CBFC), as used in InfiniBand, prevents packet loss by ensuring the sender never transmits more data than the receiver's buffer can hold. ECN marking, as used in RoCE v2, allows the network itself to signal impending congestion before buffers overflow. The Ultra Ethernet Consortium (UEC) is pioneering a hybrid architecture that combines both mechanisms to achieve deterministic losslessness with feedback-driven rate adaptation.

In a UEC converged fabric, each switch port maintains a per-flow credit counter, similar to InfiniBand's VL-based credits. When a flow's credit balance reaches zero, the sender must pause transmission for that specific flow — but crucially, it can continue transmitting other flows that still have available credits. This provides the head-of-line blocking prevention that makes InfiniBand superior to PFC-based Ethernet. However, unlike InfiniBand where credits are returned only when the receiver consumes data, UEC adds an ECN feedback loop: when the switch's shared buffer occupancy exceeds 50%, the switch begins ECN-marking packets even for flows with available credits. The sender interprets this as a pre-emptive signal to reduce its sending rate for that flow before it runs out of credits entirely.

The integration provides two distinct advantages. First, it eliminates the **credit stall problem** in pure InfiniBand, where a slow consumer on the receiver side prevents credit return even though the network itself has abundant capacity. The ECN signal bypasses the receiver's credit return logic and directly informs the sender that the network is congested. Second, it prevents the **buffer bloat problem** in pure ECN, where the feedback loop's reaction time (typically 2-5 RTTs) allows queues to grow before the sender slows down. In the hybrid scheme, the credit mechanism provides instantaneous per-flow backpressure, while ECN provides the medium-term rate adaptation that prevents long-lived congestion.

In practice, the hybrid approach adds approximately 8% to the switch ASIC logic area (for the combined credit tracking and ECN marking state machines) but reduces tail latency by 60% compared to pure ECN and eliminates the head-of-line blocking that afflicts pure PFC fabrics. Production deployments at 100,000 GPU scale show that the hybrid fabric maintains 98.5% link utilization during All-Reduce synchronization, compared to 91% for pure ECN and 93% for pure CBFC. The UEC 1.1 specification mandates this hybrid architecture for all certified switches, making it the de facto standard for next-generation AI clusters.

Fabric
Architecture.

The AI Revolution is a Network Revolution

AI Fabric Architecture

AI FABRIC ARCHITECTURE

1. RDMA: Direct Memory Access

Zero-Copy

Kernel Bypass

2. The Two Contenders: InfiniBand vs. RoCE v2

InfiniBand

Engineering Profile

RoCE v2

Engineering Profile

3. Topology: Non-Blocking Fat-Trees

Architect's Insight

The Future: 800G and Beyond

Conclusion: The Network is the Computer

Series Navigation
The Pillars of Technical Implementation

Thermal Engineering

Compute Benchmarking

Fabric Topology

Training Mechanics

Flowlet Switching Granularity Tradeoffs

ECN-Integrated Credit-Based Flow Control for Converged Fabrics

Technical Standards & References

The AI Revolution is a Network Revolution

AI Fabric Architecture

AI FABRIC ARCHITECTURE

1. RDMA: Direct Memory Access

Zero-Copy

Kernel Bypass

2. The Two Contenders: InfiniBand vs. RoCE v2

InfiniBand

Engineering Profile

RoCE v2

Engineering Profile

3. Topology: Non-Blocking Fat-Trees

Architect's Insight

The Future: 800G and Beyond

Conclusion: The Network is the Computer

Series Navigation The Pillars of Technical Implementation

Thermal Engineering

Compute Benchmarking

Fabric Topology

Training Mechanics

Flowlet Switching Granularity Tradeoffs

ECN-Integrated Credit-Based Flow Control for Converged Fabrics

Technical Standards & References

Series Navigation
The Pillars of Technical Implementation