What is a non-blocking fat-tree topology?

A non-blocking fat-tree ensures that any node can communicate with any other node at full line rate, regardless of other traffic in the network. It requires equal bandwidth at each layer of the hierarchy.

Why is rail-optimized networking used for GPUs?

Rail-optimization ensures that the same index GPU across different nodes are on the same leaf switch. This minimizes latency for the 'All-Reduce' collective operations common in distributed training.

Should I use InfiniBand or RDMA over Converged Ethernet (RoCE)?

InfiniBand offers lower latency and credit-based flow control out of the box. RoCE v2 (Ethernet) is more cost-effective and leverages existing enterprise networking expertise but requires careful DCB/PFC configuration.

Network Architecture Tool

AI Fabric
Designer

Name: AI Fabric Topology Builder
Author: Wael Abdel-Ghalil

Professional-grade design for non-blocking fat-tree and rail-optimized topologies. Calculate switch counts, cable runs, and bisection bandwidth for 800G clusters.

Fat-TreeDragonfly+Rail-Optimized800G Ethernet

Architecture Status

800G NON-BLOCKING

BACK TO TOOLKIT

Fabric Topology Builder

Design massive-scale AI networking fabrics with precision.

AI INFRASTRUCTURE ARCHITECT

GPU Fabric Builder

Design and simulate 2-tier Fat-Tree topologies for massive GPU clusters. Calculate required switches, transceivers, and bisection bandwidth in real-time.

Layer 2Spine Fabric

Layer 1Leaf / ToR

Spine Switches

Leaf Switches

Total Optics/AOC

256

Architectural Insights

The Non-Blocking MythTo achieve true non-blocking (1:1) performance, the total bandwidth coming into the Leaf layer from GPUs must be less than or equal to the bandwidth leaving the Leaf layer toward the Spines. If you use a 64-port switch and connect 48 GPUs, you only have 16 ports for uplinks, resulting in a 3:1 oversubscription.

Rail OptimizationIn this builder, we assume a standard Fat-Tree. In production H100 pods, "Rail-Optimization" maps specific GPU IDs to specific leaf switches to optimize All-Reduce collectives. This builder calculates the aggregate capacity required to support those patterns.

Non-Blocking Certified

This fabric provides sufficient bisection bandwidth for lossless RoCE v2 or InfiniBand NDR traffic. Ideal for massive Transformer-based model training.

Pingdo Reference Series | Network Engineering

Architecting Non-Blocking AI Fabrics

Optimization Strategies for Scale-Out GPU Clusters

Wael Abdel-Ghalil Last Updated: March 20, 2026 15 min read

Verified by Engineering

Beyond the Leaf: The Fat-Tree Standard

In the era of Generative AI, the network is no longer a peripheral component—it is the backplane of a massive distributed computer. A traditional oversubscribed network that works for web traffic will collapse under the weight of an LLM training job. We must move towards strictly non-blocking topologies where the bisection bandwidth matches the aggregate injection rate of all GPUs.

Rail Alignment

By aligning GPU rails (e.g., all GPU0s) to the same top-of-rack switches, we minimize the hop count for the most frequent communication patterns in 3D parallelism.

Radix Scaling

The switch radix (port count) determines how many tiers are required for a given cluster size. High-radix switches (64-128 ports) reduce latency by minimizing cable hops.

InfiniBand vs. RoCE v2

While InfiniBand remains the gold standard for pure performance due to its hardware-level flow control and low-latency header overhead, Next-Generation RDMA Ethernet (RoCE v2) has closed the gap. Modern Ethernet switches with large buffers and sophisticated ECN/PFC padding can now support clusters with tens of thousands of GPUs at near-IB efficiency.

GPU Performance Modeler

Model the compute throughput and HBM bandwidth for Blackwell and Hopper clusters.

Technical Standards & References

REF [NV-NET]

NVIDIA Networking (2024)

Analysis of High-Performance Interconnects in AI Clusters

VIEW OFFICIAL SOURCE

REF [CLOS-NET]

IEEE Standards (2023)

Clos Networks for High-Performance Computing

VIEW OFFICIAL SOURCE

REF [ARISTA-AI]

Arista Networks

RoCE v2 vs InfiniBand: A Comparative Study

VIEW OFFICIAL SOURCE

Mathematical models derived from standard engineering protocols. Not for human safety critical systems without redundant validation.

Partner in Accuracy

"You are our partner in accuracy. If you spot a discrepancy in calculations, a technical typo, or have a field insight to share, don't hesitate to reach out. Your expertise helps us maintain the highest standards of reliability."

Contributors are acknowledged in our technical updates.

AI FabricDesigner

Fabric Topology Builder

Architectural Insights

Beyond the Leaf: The Fat-Tree Standard

Rail Alignment

Radix Scaling

InfiniBand vs. RoCE v2

GPU Performance Modeler

Technical Standards & References

AI Fabric
Designer