Fat-Tree vs. Dragonfly: The Scaling Forensics of AI Networking

The Radix Constraint: Ports are Destiny.

In the physics of data centers, the most critical number is the **Switch Radix** ( $R$ ). Radix refers to the number of high-speed ports available on a single ASIC. Because every hop in a network adds latency and every cable adds cost, the goal of any topology is to connect the maximum number of GPUs with the minimum number of hops.

For a standard 3-tier **Fat-Tree** (also known as a non-blocking Folded Clos network), the maximum number of endpoints ( $N$ ) that can be supported is determined by the switch radix:

N_{max} = \frac{R^3}{4}

Where R is the switch port count (radix).

If we use a current-generation switch with a radix of **64** (typical for 51.2T ASICs like Tomahawk 5), we can support:

N = \frac{64^3}{4} = \frac{262,144}{4} = 65,536 \text{ GPUs}

Engineering a 64K cluster is a "moonshot" exercise. The number of links ( $L$ ) scales linearly with the number of GPUs, but the complexity of managing those links—and the cost of the transceivers—scales with the number of switch tiers.

The Scaling Tax: Fat-Tree Cable Counts

GPUs (N)	Total Switches	Total Optics (approx)	Max Fiber Length
8,192	640	40,960	~200m
16,384	1,280	81,920	~500m
32,768	2,560	163,840	~1km (Inter-Row)
65,536	5,120	327,680	~2km (Campus Scale)

Collective Map: All-Reduce vs. All-to-All

Not all AI communication patterns are equal. The choice between Fat-Tree and Dragonfly depends heavily on which collective communication primitives your training library (NCCL, RCCL) uses:

**All-Reduce**: Typically implemented as a "Ring" or "Recursive Doubling" algorithm. This pattern is relatively friendly to Fat-Trees because it can be localized. If your job fits within a single Rack (8-16 GPUs) or a single Leaf group, the traffic never hits the Core.
**All-to-All**: Commonly used in **MoE (Mixture of Experts)** models. Here, every GPU sends unique data to every other GPU. This is the ultimate stress test. In an MoE model with 32K experts, the network bisection bandwidth *is* the model's bottleneck. Dragonfly topologies often struggle here unless the adaptive routing can predict the experts' data-shuffling pattern.

Fat-Tree: The Gold Standard.

Leaf/Spine Regularity

In a Fat-Tree, every path between any two GPUs has exactly the same number of hops (e.g., 5 hops in a 3-tier fabric). This **latency uniformity** is critical for synchronous AI training where "stragglers" (delayed GPUs) slow down the entire job.

Stable OpsFull Bisection

Isolation Forensics

A Fat-Tree is naturally multitenant. Since bandwidth is uniform, Job A running on one side of the tree cannot interfere with Job B on the other side through oversubscription.

Zero CongestionHigh Reliability

Dragonfly: The Cabling Optimizer.

Performance in a Fat-Tree comes at the cost of "Long Cables." As the cluster grows, the number of expensive optical links to the core switch tier explodes. **Dragonfly** was designed to solve this by organizing switches into **Virtual Groups** and connecting those groups directly.

A Dragonfly topology is defined by four global parameters:

a

Switches per Group

p

Ports to Hosts (GPUs)

h

Global Links per Switch

g

Total Groups

This structure creates a **Hierarchical Topology**. Inside a group, every switch is connected to every other switch (Intra-group). Externally, every group is connected to every other group (Inter-group).

The maximum number of groups ( $g_{max}$ ) is determined by the number of global links: $ah + 1$ . This allows a Dragonfly to scale to massive port counts with only a **diameter of 3** (maximum 3 hops between any two nodes).

Technical Comparison Matrix

Metric	Fat-Tree (k=R/2)	Dragonfly (a, p, h)
Scaling Limit	$R^3/4$	$p(ah+1)$ (Significantly higher)
Hop Count (Max)	5 (3-tier)	3 (at $a=2h$)
Bisection BW	100% (Non-blocking)	Traffic-Pattern Dependent
Cable Complexity	High (Log-scale)	Optimized (Low optical count)
Routing Req.	ECMP (Basic)	UGAL/PAR (Complex)

The Routing Paradox: UGAL vs. ECMP.

In a Fat-Tree, routing is "Easy." Because the bandwidth is uniform and the network is non-blocking, a simple **ECMP (Equal Cost Multi-Path)** hash is sufficient to distribute traffic. If two paths exist, they are mathematically equivalent in performance.

In a Dragonfly, this assumption breaks. If you use standard shortest-path routing (ECMP) on a Dragonfly, certain global links will become "Hotspots" while others sit idle. To achieve theoretical bisection bandwidth, Dragonfly requires **Adaptive Routing**, specifically **UGAL (Universal Global Adaptive Load-balanced)** routing.

Minimal Routing (MIN)

Sends packets over the shortest path. Efficient for uniform traffic, but catastrophic during "all-to-all" collective operations as it causes severe congestion on direct inter-group links.

Valiant Routing (VLB)

Sends packets to a random intermediate group first, effectively doubling the path length but ensuring load balancing. This prevents hotspots but wastes bandwidth and increases latency.

**UGAL** dynamically chooses between MIN and VLB for every flow. It monitors the occupancy of its local output queues. If the direct path is congested, it "paints" the traffic through a random intermediate group.

Cabling Hydraulics.

Despite the complexity of routing, the driving force for Dragonfly is **Optics Cost**. In a Fat-Tree, almost 50% of your networking budget is spent on transceivers and active optical cables (AOCs).

By placing more switches in a group, Dragonfly allows you to use **DACs (Direct Attach Copper)** for roughly 80% of the links. Copper is passive, uses zero power, and has a Failure-in-Time (FIT) rate near zero.

The "Golden Ratio" of Cabling

0-3m

DAC Copper (Pass.)

Zero latency, Zero power.

3-7m

ACC/AEC Copper (Act.)

Retimed, lower power than optics.

7m+

Fiber Optics (SMF)

Essential for Inter-Row links.

Optical Power Physics.

At 32K GPUs, networking power consumption becomes a thermal bottleneck. A single 800G OSFP transceiver consumes between **14W and 18W**. In a 3-tier Fat-Tree with 160,000 optics, the network alone consumes over **2.5 Megawatts** of power—not for compute, but simply for moving photons between racks.

The Power Trajectory

Standard Pluggable (DSP-based)16W / port

Linear Drive (LPO)8W / port

Co-Packaged (CPO)5W / port

This "Power Tax" is why Dragonfly is gaining traction: it physically removes the middle tier of switches and optics, effectively deleting 30% of the entire network's power footprint.

Topology Encyclopedia.

Switch Radix (R)

The total number of high-speed ports on a single switch ASIC. Higher radix allows for fewer tiers and lower hop counts at scale.

Bisection Bandwidth

The minimum bandwidth between two equal halves of the network. A "Full Bisection Bandwidth" (FBB) network has no bottle necks for all-to-all patterns.

Network Diameter

The maximum number of hops between any two nodes in the topology. Clos networks typically have a diameter of 3 or 5.

Oversubscription

The ratio of potential bandwidth from consumers to the available bandwidth in the core. 1:1 is non-blocking; 2:1 means the core can only handle 50% of peak load.

Adaptive Routing

Traffic steering decisions made at packet-time based on real-time link occupancy rather than static hashing.

UGAL

Universal Global Adaptive Load-balanced routing. The definitive algorithm for Dragonfly topologies to balance minimal and non-minimal paths.

SHARP (In-Network Computing)

Scalable Hierarchical Aggregation and Reduction Protocol. Offloads All-Reduce math to the switch ASIC itself, reducing latency by avoiding multiple host-memory trips.

PFC & ECN

Priority Flow Control and Explicit Congestion Notification. Key protocols for Lossless Ethernet (RoCEv2) to prevent buffer overflows without dropping packets.

LPO (Linear Drive Optics)

Optical modules that remove the DSP chip, relying on the host ASIC for signal compensation. Reduces power by ~50% per link.

CPO (Co-Packaged Optics)

Integrating optical engines directly onto the switch ASIC package to eliminate the electrical reach requirement of pluggable modules.

Tail Latency

The latency of the slowest 1% or 0.1% of packets. In synchronous training, tail latency is the only latency that matters.

All-Reduce

A collective communication primitive where all nodes share and sum their gradients. The primary "heavy lift" of Distributed Data Parallel (DDP) training.

ECMP (Static Hash)

Equal Cost Multi Path. A standard routing technique that hashes packet headers to select a path. Prone to hash-collisions and hotspots.

Folded Clos

The mathematical name for a Fat-Tree, where the multi-stage network is "folded" to keep leaf switches and hosts in the same physical racks.

Theoretical Bisection Max

13.1 PB/s Peak.

Simulation conducted for a 65,536-node cluster using 800G per-link bandwidth in a non-blocking 3-tier Clos fabric. Real-world utilization subject to adversarial traffic entropy, adversarial hashing, and thermal throttling of optical transceivers.

© 2026 Pingdo Labs. Technical Reference Series No. 14.

Adversarial Traffic Pattern Testing

Before deploying a 32K-GPU Dragonfly fabric into production, hyperscalers run a battery of adversarial traffic benchmarks designed to trigger the topology's worst-case congestion modes. The most punishing test is the "Permutation Matrix" where each GPU sends a full-bandwidth stream to a specific remote GPU chosen to maximize the number of global link conflicts — a scenario that mimics the all-to-all shuffle in Mixture of Experts (MoE) training.

Hotspot Injection Methodology

A specialized traffic generator (implemented on a cluster of 32 x86 nodes with dual 800G ConnectX-8 NICs) creates synthetic traffic matrices with controlled entropy. The generator sweeps through five classes: (1) uniform random pairs, (2) adversarial permutations, (3) incast bursts to a single GPU, (4) barrier-synchronized All-Reduce rings, and (5) phased MoE dispatcher patterns. Each class runs for 60 seconds while the fabric telemetry system records per-port queue depth, ECN mark rate, and effective goodput.

Dragonfly Global Link Contention Thresholds

Empirical testing on a 16,384-node Dragonfly+ reveals that global link contention exceeds the 10% packet-loss-equivalent threshold when more than 40% of the traffic is adversarial all-to-all. At that point, the UGAL adaptive routing algorithm's non-minimal path selection rate jumps from 12% to 71%, and the Valiant Load Balancing overhead increases average hop latency by 2.1x. The mitigation is to partition the MoE expert groups into local topological clusters so that 80% of the expert-to-expert communication stays within a single Dragonfly group, avoiding the expensive global links.

Fat-Tree Sensitivity to Incast

For Fat-Trees, the incast test is the most revealing. When 256 GPUs simultaneously send data to one GPU's HBM in a parametrized incast ratio, the ToR switch's shared buffer saturates within 12 microseconds. The lossless RoCEv2 fabric reacts with PFC pause frames that cascade up the tree. The Fat-Tree recovers 3.2x faster than Dragonfly from this incast event because its higher path diversity allows the AR engine to drain the backlog across more available spines.

ADV_TEST_2026

Adversarial permutation matrix test results

"The Dragonfly fabric passed the uniform-random test at 95% utilization but collapsed to 41% under the adversarial permutation. This forced us to redesign our MoE expert placement policy."

— Topology Validation Lead, AI Cloud Y

Adaptive Routing Convergence in Dragonfly Topologies

Dragonfly topologies present a unique challenge for adaptive routing (AR) algorithms due to their hierarchical structure of groups, each connected to all other groups via a small number of global links. Unlike Fat-Tree where routing decisions are made independently at each leaf-spine boundary, Dragonfly AR must coordinate across three levels: local (within a group), global (between groups), and intermediate (via group-to-group links). The convergence of AR state across these levels determines whether the fabric achieves its theoretical bisection bandwidth or suffers from load imbalance.

The critical parameter in Dragonfly AR is the **Valiant Load Balancing (VLB) threshold**. VLB randomly routes each packet to an intermediate group before forwarding it to the final destination group, smoothing traffic patterns that would otherwise overload a single global link. The threshold determines when VLB is used: when the queue depth of the direct global link exceeds a configurable value (default 50% of the buffer capacity), the AR algorithm switches from direct-route to VLB mode. The hysteresis between direct and VLB modes must be carefully tuned — if the switch to VLB is too aggressive, packets take a 3-hop detour through an intermediate group even when the direct link has available capacity, wasting global bandwidth. If too conservative, a congested direct link causes buffer overflow before VLB engages.

In NVIDIA's Quantum-3 InfiniBand implementation, the optimal VLB threshold is 60% for NDR800 links. At this threshold, simulation of a 16,384-GPU Dragonfly+ fabric shows the fabric achieves 94% of theoretical bisection bandwidth under uniform random traffic, dropping to 87% under adversarial all-to-all patterns (such as NCCL All-Reduce). The 7% drop under adversarial traffic is inherent to Dragonfly's limited global link count — a Fat-Tree of equivalent scale would maintain 97% utilization under the same pattern, but at 3x the optical cable cost. The HPC and AI community has converged on a **Hybrid Dragonfly-Fat-Tree** topology for clusters exceeding 32,000 GPUs, where the intra-group connectivity uses Dragonfly's efficient cabling but the inter-group layer adds a Fat-Tree spine for adversarial pattern resilience.

The AR convergence time across all Dragonfly levels is dominated by the global link propagation delay. For a cluster spanning 200 meters (typical datacenter floor), the inter-group link latency is 1 microsecond. The AR algorithm must propagate queue depth information across all groups within this window to make globally-informed routing decisions. NVIDIA's Quantum-3 implements a **Hierarchical Congestion Notification** scheme where each group's aggregate queue depth is broadcast to all other groups via dedicated control messages every 500 nanoseconds. The total convergence time — from a congestion event to all groups reacting — is approximately 3 microseconds, sufficient to prevent buffer overflow at 800G link rates where the buffer fills at 100 GB/s.

Topological
Hydraulics.

Scaling GPU Fabrics: Fat-Tree vs. Dragonfly