Beyond the Leaf: The Fat-Tree Standard

In the era of Generative AI, the network is no longer a peripheral component—it is the backplane of a massive distributed computer. A traditional oversubscribed network that works for web traffic will collapse under the weight of an LLM training job. We must move towards strictly non-blocking topologies where the bisection bandwidth matches the aggregate injection rate of all GPUs.

Rail Alignment

By aligning GPU rails (e.g., all GPU0s) to the same top-of-rack switches, we minimize the hop count for the most frequent communication patterns in 3D parallelism.

Radix Scaling

The switch radix (port count) determines how many tiers are required for a given cluster size. High-radix switches (64-128 ports) reduce latency by minimizing cable hops.

InfiniBand vs. RoCE v2

While InfiniBand remains the gold standard for pure performance due to its hardware-level flow control and low-latency header overhead, Next-Generation RDMA Ethernet (RoCE v2) has closed the gap. Modern Ethernet switches with large buffers and sophisticated ECN/PFC padding can now support clusters with tens of thousands of GPUs at near-IB efficiency.

GPU Performance Modeler

Model the compute throughput and HBM bandwidth for Blackwell and Hopper clusters.

Share Article