Designing Non-Blocking GPU Fabrics: Fat-Tree, Clos, and Rail-Optimization

Worst-Case Sync.

In traditional cloud computing, we design for average case traffic. In AI infrastructure, we design for **worst-case synchronization**. When a Large Language Model (LLM) performs an "All-Reduce" operation, every GPU in the cluster must communicate simultaneously.

This necessitates the use of **Non-Blocking Fabrics**, where the bisection bandwidth is equal to the total aggregate bandwidth of all connected nodes.

Fabric Topology Visualizer

Live Schema

FABRIC TOPOLOGY ENGINE

Modeling Bisection Bandwidth & Path Diversity

Spine_v2_1

Spine_v2_2

1:1 Non-Blocking

Multi-path ECMP distribution across spine layer.

Fabric Performance

Throughput1.6 Pbps

Bisection100% (1:1)

Design Parameters

Switch Radix64 Ports

Hops (Avg)2.8

RoutingAdaptive

Scalability Index

"The transition from 2-layer to 3-layer Clos is the point where cable management complexity becomes a physical limit."

Visualizing multi-tier Clos networks and bisection nodes.

1. The Fat-Tree (Clos) Topology

Named after Charles Clos, the 3-tier Fat-Tree is the gold standard for AI clusters. Unlike a standard enterprise tree where the "trunk" is a bottleneck, a Fat-Tree gets thicker as you move toward the core.

Level 1 - Leaf

To-the-Rack (ToR); switches connecting GPUs. In AI, these are often 1:1 speed matched (e.g., 8 x 400G down, 8 x 400G up).

Level 2 - Spine

The aggregation layer. Every Leaf switch connects to every Spine switch, creating a multi-path fabric.

Level 3 - Super

The Core layer for massive clusters. These interconnect multiple pods of Leaf/Spine groups into a single 10k+ node domain.

2. Rail-Optimized Architecture

Modern AI servers (like the NVIDIA DGX H100) contain 8 GPUs. To minimize latency and simplify cabling, we use **Rail-Optimization**.

By keeping these rails physically grouped on the same leaf switches, we reduce the number of optical "hops" a packet must take, slash tail latency, and prevent one GPU's traffic from interfering with another rail.

3. Oversubscription Math

In enterprise IT, an oversubscription of 10:1 or 20:1 is common. In AI, we aim for **1:1 (Non-Blocking)**.

1:1 Non-Blocking

Total upstream capacity = total downstream capacity. Zero congestion at the fabric level. Mandatory for Top-Tier LLM training.

2:1 Oversubscribed

Saves 50% on spine switches and optics. Acceptable for inference clusters or smaller fine-tuning jobs.

Engineering Tool

Topology
Builder.

Design a Fat-Tree topology, calculate switch requirements, and verify bisection bandwidth for your specific GPU count.

Engineering Knowledge Expansion

RoCE v2 vs. InfiniBand: The Protocol War

Read Article

Introduction to AI Networking

Read Article

Infrastructure Engineering Series

Series Navigation
The Pillars of Technical Implementation

Infrastructure

Thermal Engineering

Direct Liquid Cooling (DLC) and rack-scale thermodynamics for 120kW+ density.

Explore Component

Infrastructure

Compute Benchmarking

H100 vs Blackwell architecture. Analyzing FP8/FP4 TFLOPS and memory scaling.

Explore Component

Network

Fabric Topology

Fat-Tree, Dragonfly, and rail-optimized networking architectures for GPU clusters.

Explore Component

Compute

Training Mechanics

Gradient synchronization, All-Reduce bottlenecks, and NCCL optimization patterns.

Explore Component

Topology
Design.

Worst-Case Sync.

Fabric Topology Visualizer

FABRIC TOPOLOGY ENGINE

Design Parameters

1. The Fat-Tree (Clos) Topology

Level 1 - Leaf

Level 2 - Spine

Level 3 - Super

2. Rail-Optimized Architecture

3. Oversubscription Math

1:1 Non-Blocking

2:1 Oversubscribed

Topology
Builder.

RoCE v2 vs. InfiniBand: The Protocol War

Introduction to AI Networking

Series Navigation
The Pillars of Technical Implementation

Thermal Engineering

Compute Benchmarking

Fabric Topology

Training Mechanics

Technical Standards & References

Worst-Case Sync.

Fabric Topology Visualizer

FABRIC TOPOLOGY ENGINE

Design Parameters

1. The Fat-Tree (Clos) Topology

Level 1 - Leaf

Level 2 - Spine

Level 3 - Super

2. Rail-Optimized Architecture

3. Oversubscription Math

1:1 Non-Blocking

2:1 Oversubscribed

Topology Builder.

RoCE v2 vs. InfiniBand: The Protocol War

Introduction to AI Networking

Series Navigation The Pillars of Technical Implementation

Thermal Engineering

Compute Benchmarking

Fabric Topology

Training Mechanics

Technical Standards & References

Topology
Builder.

Series Navigation
The Pillars of Technical Implementation