Scaling GPU Fabrics: Fat-Tree vs. Dragonfly
The Radix Constraint: Ports are Destiny.
In the physics of data centers, the most critical number is the **Switch Radix** (). Radix refers to the number of high-speed ports available on a single ASIC. Because every hop in a network adds latency and every cable adds cost, the goal of any topology is to connect the maximum number of GPUs with the minimum number of hops.
For a standard 3-tier **Fat-Tree** (also known as a non-blocking Folded Clos network), the maximum number of endpoints () that can be supported is determined by the switch radix:
Where R is the switch port count (radix).
If we use a current-generation switch with a radix of **64** (typical for 51.2T ASICs like Tomahawk 5), we can support:
Engineering a 64K cluster is a "moonshot" exercise. The number of links () scales linearly with the number of GPUs, but the complexity of managing those links—and the cost of the transceivers—scales with the number of switch tiers.
The Scaling Tax: Fat-Tree Cable Counts
| GPUs (N) | Total Switches | Total Optics (approx) | Max Fiber Length |
|---|---|---|---|
| 8,192 | 640 | 40,960 | ~200m |
| 16,384 | 1,280 | 81,920 | ~500m |
| 32,768 | 2,560 | 163,840 | ~1km (Inter-Row) |
| 65,536 | 5,120 | 327,680 | ~2km (Campus Scale) |
Collective Map: All-Reduce vs. All-to-All
Not all AI communication patterns are equal. The choice between Fat-Tree and Dragonfly depends heavily on which collective communication primitives your training library (NCCL, RCCL) uses:
- **All-Reduce**: Typically implemented as a "Ring" or "Recursive Doubling" algorithm. This pattern is relatively friendly to Fat-Trees because it can be localized. If your job fits within a single Rack (8-16 GPUs) or a single Leaf group, the traffic never hits the Core.
- **All-to-All**: Commonly used in **MoE (Mixture of Experts)** models. Here, every GPU sends unique data to every other GPU. This is the ultimate stress test. In an MoE model with 32K experts, the network bisection bandwidth *is* the model's bottleneck. Dragonfly topologies often struggle here unless the adaptive routing can predict the experts' data-shuffling pattern.
Fat-Tree: The Gold Standard.
Leaf/Spine Regularity
In a Fat-Tree, every path between any two GPUs has exactly the same number of hops (e.g., 5 hops in a 3-tier fabric). This **latency uniformity** is critical for synchronous AI training where "stragglers" (delayed GPUs) slow down the entire job.
Isolation Forensics
A Fat-Tree is naturally multitenant. Since bandwidth is uniform, Job A running on one side of the tree cannot interfere with Job B on the other side through oversubscription.
Dragonfly: The Cabling Optimizer.
Performance in a Fat-Tree comes at the cost of "Long Cables." As the cluster grows, the number of expensive optical links to the core switch tier explodes. **Dragonfly** was designed to solve this by organizing switches into **Virtual Groups** and connecting those groups directly.
A Dragonfly topology is defined by four global parameters:
Switches per Group
Ports to Hosts (GPUs)
Global Links per Switch
Total Groups
This structure creates a **Hierarchical Topology**. Inside a group, every switch is connected to every other switch (Intra-group). Externally, every group is connected to every other group (Inter-group).
The maximum number of groups () is determined by the number of global links: . This allows a Dragonfly to scale to massive port counts with only a **diameter of 3** (maximum 3 hops between any two nodes).
Technical Comparison Matrix
| Metric | Fat-Tree (k=R/2) | Dragonfly (a, p, h) |
|---|---|---|
| Scaling Limit | (Significantly higher) | |
| Hop Count (Max) | 5 (3-tier) | 3 (at $a=2h$) |
| Bisection BW | 100% (Non-blocking) | Traffic-Pattern Dependent |
| Cable Complexity | High (Log-scale) | Optimized (Low optical count) |
| Routing Req. | ECMP (Basic) | UGAL/PAR (Complex) |
The Routing Paradox: UGAL vs. ECMP.
In a Fat-Tree, routing is "Easy." Because the bandwidth is uniform and the network is non-blocking, a simple **ECMP (Equal Cost Multi-Path)** hash is sufficient to distribute traffic. If two paths exist, they are mathematically equivalent in performance.
In a Dragonfly, this assumption breaks. If you use standard shortest-path routing (ECMP) on a Dragonfly, certain global links will become "Hotspots" while others sit idle. To achieve theoretical bisection bandwidth, Dragonfly requires **Adaptive Routing**, specifically **UGAL (Universal Global Adaptive Load-balanced)** routing.
Minimal Routing (MIN)
Sends packets over the shortest path. Efficient for uniform traffic, but catastrophic during "all-to-all" collective operations as it causes severe congestion on direct inter-group links.
Valiant Routing (VLB)
Sends packets to a random intermediate group first, effectively doubling the path length but ensuring load balancing. This prevents hotspots but wastes bandwidth and increases latency.
**UGAL** dynamically chooses between MIN and VLB for every flow. It monitors the occupancy of its local output queues. If the direct path is congested, it "paints" the traffic through a random intermediate group.
Cabling Hydraulics.
Despite the complexity of routing, the driving force for Dragonfly is **Optics Cost**. In a Fat-Tree, almost 50% of your networking budget is spent on transceivers and active optical cables (AOCs).
By placing more switches in a group, Dragonfly allows you to use **DACs (Direct Attach Copper)** for roughly 80% of the links. Copper is passive, uses zero power, and has a Failure-in-Time (FIT) rate near zero.
The "Golden Ratio" of Cabling
0-3m
DAC Copper (Pass.)
Zero latency, Zero power.
3-7m
ACC/AEC Copper (Act.)
Retimed, lower power than optics.
7m+
Fiber Optics (SMF)
Essential for Inter-Row links.
Optical Power Physics.
At 32K GPUs, networking power consumption becomes a thermal bottleneck. A single 800G OSFP transceiver consumes between **14W and 18W**. In a 3-tier Fat-Tree with 160,000 optics, the network alone consumes over **2.5 Megawatts** of power—not for compute, but simply for moving photons between racks.
The Power Trajectory
This "Power Tax" is why Dragonfly is gaining traction: it physically removes the middle tier of switches and optics, effectively deleting 30% of the entire network's power footprint.
Topology Encyclopedia.
Switch Radix (R)
The total number of high-speed ports on a single switch ASIC. Higher radix allows for fewer tiers and lower hop counts at scale.
Bisection Bandwidth
The minimum bandwidth between two equal halves of the network. A "Full Bisection Bandwidth" (FBB) network has no bottle necks for all-to-all patterns.
Network Diameter
The maximum number of hops between any two nodes in the topology. Clos networks typically have a diameter of 3 or 5.
Oversubscription
The ratio of potential bandwidth from consumers to the available bandwidth in the core. 1:1 is non-blocking; 2:1 means the core can only handle 50% of peak load.
Adaptive Routing
Traffic steering decisions made at packet-time based on real-time link occupancy rather than static hashing.
UGAL
Universal Global Adaptive Load-balanced routing. The definitive algorithm for Dragonfly topologies to balance minimal and non-minimal paths.
SHARP (In-Network Computing)
Scalable Hierarchical Aggregation and Reduction Protocol. Offloads All-Reduce math to the switch ASIC itself, reducing latency by avoiding multiple host-memory trips.
PFC & ECN
Priority Flow Control and Explicit Congestion Notification. Key protocols for Lossless Ethernet (RoCEv2) to prevent buffer overflows without dropping packets.
LPO (Linear Drive Optics)
Optical modules that remove the DSP chip, relying on the host ASIC for signal compensation. Reduces power by ~50% per link.
CPO (Co-Packaged Optics)
Integrating optical engines directly onto the switch ASIC package to eliminate the electrical reach requirement of pluggable modules.
Tail Latency
The latency of the slowest 1% or 0.1% of packets. In synchronous training, tail latency is the only latency that matters.
All-Reduce
A collective communication primitive where all nodes share and sum their gradients. The primary "heavy lift" of Distributed Data Parallel (DDP) training.
ECMP (Static Hash)
Equal Cost Multi Path. A standard routing technique that hashes packet headers to select a path. Prone to hash-collisions and hotspots.
Folded Clos
The mathematical name for a Fat-Tree, where the multi-stage network is "folded" to keep leaf switches and hosts in the same physical racks.
Theoretical Bisection Max
13.1 PB/s Peak.
Simulation conducted for a 65,536-node cluster using 800G per-link bandwidth in a non-blocking 3-tier Clos fabric. Real-world utilization subject to adversarial traffic entropy, adversarial hashing, and thermal throttling of optical transceivers.
© 2026 Pingdo Labs. Technical Reference Series No. 14.
