NVIDIA NDR800 vs. NDR400: Scaling 800G AI Fabrics

The move from NVIDIA's Quantum-2 (NDR 400G) to Quantum-3 (NDR800 800G) is not merely a bandwidth doubling; it is a structural realignment of the AI data center. For the first time, the industry is grappling with the physical limits of copper, the thermal reality of 25W transceivers, and the mathematical requirement for 224G SerDes lanes.

Quantum-2 (NDR400)

Throughput: 400 Gbps per port (Standard 4-lane)
Switch Capacity: 51.2 Tbps Aggregate
SerDes: 112G PAM4 (Stable Layer)
Radix: Up to 64 physical ports

Quantum-3 (NDR800)

Throughput: 800 Gbps per port (Native 800G)
Switch Capacity: 102.4 Tbps (World Leading)
SerDes: 224G PAM4 (Quantum-3 Silicon)
Radix: Massive 72/144 port configurations

1. The ASIC Breakthrough: 102.4 Tbps Architecture.

The heartbeat of NDR800 is the **Quantum-3 switch silicon**. Scaling from 51.2T to 102.4T requires more than just cramming twice the transistors; it demands a total redesign of the packet buffer architecture and crossbar arbitration. In modern LLM training, a single 102.4T switch can replace four legacy 25.6T switches in a Fat-Tree architecture, drastically reducing the "hop latency" that plagues large clusters.

The Quantum-3 ASIC implements a **Shared-Pool Buffer** strategy. Unlike traditional switches that partition memory per port, Quantum-3 treats the entire on-chip SRAM as a unified pool accessible by any port experiencing a congestion micro-burst. This is critical for AI "All-to-All" patterns where traffic is highly bursty and synchronized.

Quantum-3 Scaling Analysis

Predicting the Generational Jump in AI Fabric Efficiency

Efficiency+66%

Throughput2.0x

Bandwidth vs. Port Density

Power Efficiency (pJ / Bit)

Silicon Optimization

Quantum-3 ASIC

Integrated OSFP800 support with integrated Laser Drivers, reducing optical interface power by ~40% vs Quantum-2 discrete modules.

Reliability

Zero-Error Fabric

Advanced Forward Error Correction (FEC) algorithms tailored for NDR800 bursts, maintaining bit-error-rates below 10⁻¹⁵ across 2km distances.

TCO Impact

Rack Consolidation

Moving to 800G allows for a 50% reduction in switch-to-switch cabling, directly impacting cooling overhead and rack-unit efficiency in exascale pods.

2. The Physical Frontier: SerDes 112G vs. 224G.

At the physical layer, NDR800 is the first mass-production fabric to push **224G SerDes** lanes. The transition from 112G to 224G is the most difficult electrical challenge in history. At 224G, the signal wavelength is so small that trace lengths in standard PCB materials (like FR4 or even Megtron 7) cause massive attenuation and signal reflection.

To combat this, NVIDIA and its partners have moved toward **Flyover Cables**—internal micro-coaxial cables that bypass the PCB traces entirely, connecting the ASIC directly to the OSFP port. This "Cable-over-PCB" architecture reduces insertion loss by up to 10dB, which is the difference between a stable 800G link and a total signal collapse.

Furthermore, 224G signaling requires an even more aggressive **Pre-emphasis and Equalization** strategy. The Quantum-3 SerDes uses a 7-tap FFE (Feed-Forward Equalizer) combined with a DFE (Decision Feedback Equalizer) to reconstruct the signal from the noise. This computational intensity is why the switch power envelope has ballooned to over 1000W.

3. The FEC Burden: Math vs. Nanoseconds.

One of the most misunderstood aspects of 800G InfiniBand is the impact of **Forward Error Correction (FEC)**. In 800G (NDR800), the signal-to-noise ratio is so tight that the link cannot stay up for more than a few seconds without heavy error correction.

KP4-FEC (Standard)

Adds ~18ns of latency. Provides enough 'coding gain' to handle a Raw BER (Bit Error Rate) of 1E-4. This is the baseline for 2km optical reaches.

Latency: 18.2 ns | Gain: 6.2 dB

Strong-FEC (Extended)

Adds ~45ns of latency. Mandatory for copper DAC cables longer than 1.5 meters or degraded optical fibers.

Latency: 44.8 ns | Gain: 9.1 dB

For AI training, 45ns might seem trivial, but across a 3-layer fabric, FEC-induced latency can add up to 270ns of round-trip overhead. In latency-sensitive collective operations like **All-Reduce**, this can contribute to a 2-3% drop in aggregate TFLOPS.

FEC Latency Comparison

Understanding encoding/decoding latency trade-offs

Codeword Structure

RS(544, 514)

514 Data Symbols30 Parity Symbols

Error Correction Capability

16symbol errors/codeword

5.5%

Bandwidth Overhead

7-8 dB

Coding Gain

Latency Breakdown

Data Flow Direction

Block Formation (~10ns)

RS Encoding (~40-60ns)

Parity TX (~10ns)

Total Round-Trip Latency

~200-240ns

Per-Direction

100-120ns

Context in AI Clusters

For RDMA with ~2μs RTT, KP4's ~200ns overhead is only 10% of total latency. The 7-8 dB coding gain far outweighs this small penalty for reliable 400G+ links.

Recommended For

Standard 400G+ data center

Error Margin

BER 10^-12 to 10^-15

RTT Impact

~10% of RDMA RTT

4. Thermal Complexity: The End of Air Cooling.

A typical NDR800 deployment is no longer air-cooled. Each Quantum-3 switch consumes as much power as three full racks of standard enterprise servers did ten years ago. The heat density at the **OSFP800 cage** is particularly problematic; the concentrated heat from 128 transceivers, each drawing 20W, can exceed 2.5kW in a 1U chassis.

The DLC (Direct Liquid Cooling) Mandate

Modern 800G switches use **Cold Plates** directly on the ASIC and often secondary liquid loops for the OSFP cages. Air-cooled racks require 'Monstrous' fans spinning at 20,000 RPM, consuming up to 300W just for air movement. Liquid cooling reduces this parasitic power to ~15W per switch, drastically improving the Data Center PUE.

Thermal Flow Simulator

Data Center Cooling Analysis

1.42 Tons

COOLING REQUIRED

POWER LOAD (W)5000 W

AMBIENT TEMP (°C)25°C

HEAT OUTPUT

17060 BTU/hr

COOLING CAPACITY

1.42 Tons

AIRFLOW REQUIRED

790 CFM

ASHRAE Guidelines: Data centers should maintain inlet temperatures between 18-27°C (64-80°F). Every 1kW of IT load generates 3,412 BTU/hr of heat. CRAC/CRAH units must provide sufficient airflow (CFM) to maintain the temperature delta between cold and hot aisles. Always size cooling systems with 20-30% overhead for redundancy and future growth.

5. Subnet Management at 100k GPU Scale.

The **InfiniBand Subnet Manager (SM)** is the "brain" of the fabric. In a 400G NDR cluster of 10k GPUs, the SM can typically re-calculate the routing tables in under a second during a link failure. As we move to NDR800 and 100k+ GPU clusters, the computational complexity of the routing algorithm (Linear Forwarding Tables) scales exponentially.

NDR800 introduces **Parallel SM** and **Hardware-Assisted Topology Discovery**. This allows the fabric to "self-heal" by offloading the routing calculations to the switches' management controllers. Without these Quantum-3 specific SM enhancements, a single link flap in a 100k GPU cluster could "freeze" the entire fabric for several seconds, causing massive training time-outs.

6. The Optics Revolution: LPO vs. CPO.

At 800G, the transceiver becomes a major heat source. To solve this, the industry is splitting into two camps: **Linear Drive Optics (LPO)** and **Co-Packaged Optics (CPO)**. LPO removes the DSP (Digital Signal Processor) from the transceiver, putting the burden of signal cleanup on the switch silicon. This reduces latency by ~100ns and power by ~4W per port.

Quantum-3 is the first InfiniBand switch designed with the signal integrity headroom to support LPO at scale. However, CPO takes this further by bringing the laser and optical engine inside the ASIC package itself. While NDR800 primarily relies on pluggable OSFP modules, the path to NDR1600 (1.6T) will almost certainly require the transition to CPO to maintain a manageable power density.

7. The 224G Loss Budget: Engineering the Trace.

Designing a PCB for 224G PAM4 signaling is an exercise in extreme material science. The "Loss Budget" for a 224G link is typically capped at **35-40dB** end-to-end. Standard high-speed PCB traces can lose 1.5dB per inch at 56GHz. This gives engineers a maximum trace length of less than 10 inches before the signal is unrecoverable.

Advanced Engineering FAQ.

[1]What is the 'Radix Tax' of NDR800?

As port density increases, the internal crossbar of the switch becomes more complex. Quantum-3 manages this using a hierarchical scheduler that groups ports into 'Local Domains' to avoid a massive centralized bottleneck. The tax is approximately 32ns of internal traversal latency compared to Quantum-2.

[2]Why is 224G SerDes considered the 'Nyquist Wall' for copper?

At 224Gbps, the Nyquist frequency is 56GHz. Skin effect and dielectric loss are so severe that copper cables act more like antennas than conductors. This is why 800G InfiniBand is the first generation to move toward Active Copper Cables (ACC) with internal amplifiers.

[3]Can NDR800 and NDR400 coexist in the same Rail-Optimized design?

Technically yes, but it destroys 'Rail Symmetry'. Traffic hitting an NDR400 rail will experience 2x higher serialization delay, causing the entire collective operation to anchor to that slower speed. It is strictly recommended to keep rails homogeneous.

[4]How does SHARPv4 handle Sparse Tensors better than SHARPv3?

SHARPv4 introduces 'Dynamic Masking' within the switch ASIC. It can identify and skip zero-valued elements in a sparse tensor gradient during flight, reducing the aggregate data volume that needs to be reduced by up to 5x for specific LLM pruning workloads.

[5]What is the impact of Linear Drive Optics (LPO) on NDR800 Bit Error Rate?

LPO removes the DSP from the transceiver, putting all the equalization burden on the switch SerDes. This results in a tighter BER margin. If your patch cable has a single fingerprint or a micro-scratch, an LPO link at 800G will fail, whereas a DSP-based link might compensate.

[6]Why did NVIDIA choose OSFP over QSFP for 800G?

OSFP's larger thermal 'sink' can dissipate up to 25W per module. QSFP is capped at ~15-18W. At 800G, especially with coherent optics or long-reach ZR modules, 15W is simply not enough for reliability.

[7]Does NDR800 support 'In-Network Security'?

Yes. Quantum-3 includes line-rate MACsec encryption (256-bit) with zero latency impact. This is critical for multi-tenant AI clouds where training data must be protected between GPUs in different racks.

Technical Benchmark: The NDR800 Edge

Metric	Quantum-2 (400G)	Quantum-3 (800G)
Total Throughput	51.2 Tbps	102.4 Tbps
Switch Latency	400 ns (Base)	450 ns (Aggressive FEC)
2-Layer GPU Max	2,048 GPUs	10,368+ GPUs
In-Network Compute	SHARPv3	SHARPv4 (Sparse Support)
Cooling Efficiency	~0.6W / Gbps	~0.42W / Gbps (LPO)