What is a non-blocking fat-tree topology?

A non-blocking fat-tree ensures that any node can communicate with any other node at full line rate, regardless of other traffic in the network. It requires equal bandwidth at each layer of the hierarchy.

Why is rail-optimized networking used for GPUs?

Rail-optimization ensures that the same index GPU across different nodes are on the same leaf switch. This minimizes latency for the 'All-Reduce' collective operations common in distributed training.

Should I use InfiniBand or RDMA over Converged Ethernet (RoCE)?

InfiniBand offers lower latency and credit-based flow control out of the box. RoCE v2 (Ethernet) is more cost-effective and leverages existing enterprise networking expertise but requires careful DCB/PFC configuration.

Network Architecture Tool

AI Fabric
Designer

Name: AI Fabric Topology Builder
Author: Wael Abdel-Ghalil

Professional-grade design for non-blocking fat-tree and rail-optimized topologies. Calculate switch counts, cable runs, and bisection bandwidth for 800G clusters.

Fat-TreeDragonfly+Rail-Optimized800G Ethernet

Architecture Status

800G NON-BLOCKING

BACK TO TOOLKIT

Fabric Topology Builder

Design massive-scale AI networking fabrics with precision.

AI INFRASTRUCTURE ARCHITECT

GPU Fabric Builder

Design and simulate 2-tier Fat-Tree topologies for massive GPU clusters. Calculate required switches, transceivers, and bisection bandwidth in real-time.

Layer 2Spine Fabric

Layer 1Leaf / ToR

Spine Switches

Leaf Switches

Total Optics/AOC

256

Architectural Insights

The Non-Blocking MythTo achieve true non-blocking (1:1) performance, the total bandwidth coming into the Leaf layer from GPUs must be less than or equal to the bandwidth leaving the Leaf layer toward the Spines. If you use a 64-port switch and connect 48 GPUs, you only have 16 ports for uplinks, resulting in a 3:1 oversubscription.

Rail OptimizationIn this builder, we assume a standard Fat-Tree. In production H100 pods, "Rail-Optimization" maps specific GPU IDs to specific leaf switches to optimize All-Reduce collectives. This builder calculates the aggregate capacity required to support those patterns.

Non-Blocking Certified

This fabric provides sufficient bisection bandwidth for lossless RoCE v2 or InfiniBand NDR traffic. Ideal for massive Transformer-based model training.

Pingdo Reference Series | Network Engineering

Architecting Non-Blocking AI Fabrics

Optimization Strategies for Scale-Out GPU Clusters

Wael Abdel-Ghalil Last Updated: March 20, 2026 15 min read

Verified by Engineering

Beyond the Leaf: The Fat-Tree Standard

In the era of Generative AI, the network is no longer a peripheral component—it is the backplane of a massive distributed computer. A traditional oversubscribed network that works for web traffic will collapse under the weight of an LLM training job. We must move towards strictly non-blocking topologies where the bisection bandwidth matches the aggregate injection rate of all GPUs.

Rail Alignment

By aligning GPU rails (e.g., all GPU0s) to the same top-of-rack switches, we minimize the hop count for the most frequent communication patterns in 3D parallelism.

Radix Scaling

The switch radix (port count) determines how many tiers are required for a given cluster size. High-radix switches (64-128 ports) reduce latency by minimizing cable hops.

InfiniBand vs. RoCE v2

While InfiniBand remains the gold standard for pure performance due to its hardware-level flow control and low-latency header overhead, Next-Generation RDMA Ethernet (RoCE v2) has closed the gap. Modern Ethernet switches with large buffers and sophisticated ECN/PFC padding can now support clusters with tens of thousands of GPUs at near-IB efficiency.

Using This Tool

The Fabric Topology Builder translates high-level cluster requirements into concrete networking bills of materials. Follow these steps to model your AI fabric with precision.

Step 1 — Select GPU Count. Enter the total number of accelerators in your target cluster. The tool scales its calculations based on this figure, which drives everything from leaf switch count to the number of spine tiers. For production planning, include a 5–10% growth buffer beyond your immediate procurement target so the fabric does not become the bottleneck before the GPUs are even decommissioned.

Step 2 — Choose Topology Type. Select from three architecture families. Fat-Tree delivers strict non-blocking bisection bandwidth — the standard for training clusters where every byte of gradient data must arrive on time. Rail-Optimized aligns GPUs by index across nodes onto the same leaf switch, collapsing the hop distance for All-Reduce and All-Gather collectives. Dragonfly+ uses groups of fully connected switches with global links between groups, reducing the switch count and cost-per-port at very large scale but introducing subtle routing and load-balancing considerations that demand careful validation.

Step 3 — Configure Switch Radix and Port Speed. The radix — the number of ports per switch — determines how many tiers the topology requires. A 64-port switch can saturate a 32-GPU leaf group with non-blocking uplinks; a 128-port switch doubles that headroom. Port speed (400G or 800G per lane) sets the raw injection rate per GPU. Multiply GPU count by port speed to get aggregate fabric bandwidth, then use the tool to verify that the selected topology can deliver it at each tier without creating bottlenecks at the spine or super-spine layer.

Step 4 — Interpret the Output. The builder displays the total switch count at each tier (leaf, spine, super-spine), the number of optical or copper cable runs, and the bisection bandwidth — the throughput available when half the GPUs communicate with the other half simultaneously. A ratio of 1:1 (bisection to aggregate injection) means fully non-blocking. Ratios below 1:1 indicate oversubscription; ratios above 1:1, while rare, suggest surplus bandwidth that may inflate cost without practical benefit. Use the cable count to estimate your optics budget and physical cable management requirements before placing purchase orders with your networking vendor.

Design Scenarios

512-GPU LLM Training Cluster

A 512-GPU cluster built around NVIDIA DGX H100 or B200 nodes — eight GPUs per node, eight nodes per rail — is the canonical entry point for enterprise-grade LLM training. In a fat-tree layout with 400G InfiniBand or RoCE links, each leaf switch hosts one rail of GPUs (64 GPUs per leaf, 8 leaves). The spine tier, built from 64-port switches, carries the traffic from all leaves. With a 2:1 oversubscription you can train a 70B-parameter model comfortably; for a 175B model with tensor parallelism across eight GPUs and pipeline parallelism across nodes, you want 1:1 non-blocking to prevent the All-Reduce latency from dominating the critical path. At 1:1, expect roughly 8 leaf switches, 8 spine switches, and approximately 512 transceiver-terminated fiber runs. Cable management at this scale is non-trivial — each DGX ships with eight network interfaces, producing 64 fabric-facing links per rack, and every link must be mapped to the correct leaf port to preserve rail alignment.

The cost profile for this cluster is dominated by optics. At 400G, a single QSFP-DD transceiver can run upwards of $800; at 512 GPUs with two ports per GPU (for redundancy or rail-optimized dual-rail), the optical budget alone exceeds $800,000. Running active optical cables (AOCs) instead of pluggable optics reduces the per-link cost modestly but limits reconfiguration flexibility. Copper DACs, at roughly one-tenth the cost per link, are viable only for leaf-to-server connections within a rack — the 3-meter reach ceiling makes them useless for spine-to-super-spine distances, where runs easily exceed 50 meters.

4,096-GPU Inference Farm

Inference clusters observe fundamentally different traffic patterns from training clusters. An LLM serving request is processed by a single GPU (or a small tensor-parallel group of 4 or 8), not by the entire cluster simultaneously. The dominant communication is not All-Reduce but request routing, KV-cache lookups, and disaggregated prefill-decode handoffs. This means you can tolerate far higher oversubscription ratios — 3:1 or even 4:1 — without measurable latency impact, because only a fraction of GPUs contend for fabric bandwidth at any moment.

At 4,096 GPUs in a 4:1 oversubscribed fat-tree, you might build with 64 leaf switches (64 GPUs each) and only 8 spine switches, yielding 512 leaf-to-spine links. The cost savings relative to a 1:1 fabric are dramatic — up to 60% fewer spine switches and optics. The trade-off is headroom: if your inference workload includes periodic burst training for continuous fine-tuning, the oversubscribed fabric will throttle gradient synchronisation to the speed of the weakest link. Plan the oversubscription ratio by profiling your actual workload mix under realistic traffic conditions, not by assuming worst-case training demands that may never materialize in an inference-dominant deployment.

16K-GPU Frontier-Scale

At 16,384 GPUs, a conventional three-tier fat-tree (leaf, spine, super-spine) begins to strain both economics and cabling logistics. With 128-port switches and 64 GPUs per leaf, 256 leaves feed into 128 spine switches, which in turn feed into 64 super-spine switches — roughly 448 switches, tens of thousands of optical transceivers, and a cabling density that challenges the physical design of the data hall. The cable plant alone can consume 20% of the total cluster budget, and the mean time between failures across that many optical links becomes a first-order operational concern.

Dragonfly+ offers an alternative at this scale. By organizing switches into groups with all-to-all intra-group links and selective inter-group connections, the total switch count drops by 20–30% compared to an equivalent three-tier fat-tree. The penalty is adaptive routing complexity: packets traversing multiple groups may encounter non-minimal paths, and the tail latency of All-Reduce can increase if the routing algorithm makes poor path choices under load. Modern implementations from Meta and HPE Slingshot mitigate this with per-hop credit-based flow control and telemetry-driven congestion awareness that dynamically steers traffic away from hot spots in the fabric.

Optical circuit switching (OCS) is emerging as a complementary technology for 16K+ clusters. By deploying an OCS patch panel between the spine and super-spine tiers — Google's Jupiter and TPU v5p fabrics use this approach — operators can reconfigure the topology in software to match the communication pattern of the current job. An All-Reduce-heavy training run might use a low-diameter folded Clos; an inference-dominated workload might rewire to a simpler spine-leaf. OCS eliminates the need to physically recable and allows topologies to evolve with the workload, though it introduces a reconfiguration latency of tens of milliseconds per topology switch that must be amortized across sufficiently long-running jobs.

Common Pitfalls

Oversubscription Without Headroom During All-Reduce

The most frequent mistake in AI fabric design is applying web-tier oversubscription ratios (3:1 or higher) to training clusters. During an All-Reduce operation, every GPU in the job simultaneously transmits and receives an equal volume of gradient data — there are no idle endpoints to soak up spare bandwidth. If the fabric provides less bisection bandwidth than half the aggregate injection rate, the reduction tree stalls, backpressure propagates through the switch hierarchy, and the effective GPU utilization — already challenged by pipeline bubbles and checkpointing — drops below 50%. For synchronous distributed training, design for the collective operation, not for the average-case east-west traffic profile inherited from web-tier networking.

Ignoring Rail Alignment for GPU Clusters

When GPU0 of node A and GPU0 of node B are connected to different leaf switches, every All-Reduce across those GPUs must traverse a spine hop, adding approximately 500 nanoseconds of switching latency per step. Over a training run lasting weeks, this cumulative latency can add days to the wall-clock time. Rail alignment — mapping the same-index GPU across all nodes to the same leaf switch — keeps the most frequent communication pattern within a single switch ASIC, eliminating the spine hop for the reduce-scatter and all-gather phases that dominate NCCL collective communications. The tool's Rail-Optimized mode models exactly this constraint.

DAC Cable Length Limitations at 400G/800G

Direct-attach copper (DAC) cables are cost-effective — roughly $50 per 400G link versus $300–$800 for active optical cables — but they are physically limited to approximately 3 meters at 400G (QSFP-DD) and shrink to roughly 2 meters at 800G (OSFP/QSFP-DD800) due to signal integrity degradation at high frequencies. Attempting to stretch DACs beyond their rated distance or using them for leaf-to-spine connections across aisles produces symbol errors, FEC uncorrectable blocks, and intermittent link flaps that are notoriously difficult to diagnose. Reserve DACs for in-rack server-to-leaf connections; use optical transceivers or AOCs for any run that crosses a rack boundary, no matter how tempting the cost difference looks on a spreadsheet.

Under-Provisioning Switch Buffer for Incast

Incast — the many-to-one traffic pattern where N GPUs simultaneously send to a single destination — is endemic to distributed training, particularly during gradient aggregation. When a spine switch port receiving traffic from dozens of leaf switches runs out of buffer memory, it issues Priority Flow Control (PFC) pauses that ripple backward through the fabric, creating congestion trees that throttle throughput far beyond the original bottleneck. Industry guidance suggests provisioning at least 64 MB of shared buffer per 400G port, with deep virtual output queues, to absorb incast bursts without triggering PFC storms. Verify your switch silicon's buffer architecture before committing to a vendor; buffer depth is a fixed silicon property, not a software-upgradable feature, and insufficient buffer is the root cause of many a weekend outage postmortem.

Best Practices

These practices emerged from the operational experience of hyperscaler AI networking teams and represent the current consensus on building reliable, performant GPU fabrics at scale.

Design for the Collective, Not the Average

Web-scale traffic engineering teaches us to dimension for the 95th percentile of utilization. AI fabrics invert this principle: the critical path is defined by the All-Reduce ring, the All-Gather broadcast tree, and the Reduce-Scatter butterfly, each of which demands near-simultaneous full-mesh communication across the entire job. Dimension the fabric for the peak instantaneous bandwidth demanded by these collectives, using worst-case message sizes (often the full gradient tensor, which can exceed 1 GB per layer for large models), not the time-averaged throughput. A fabric that looks 80% idle in SNMP polling is likely saturated during the millisecond bursts that actually determine training throughput.

Right-Size Switch Radix for TCO

Higher radix reduces tier count and cable complexity but increases per-port cost and power draw. At 64 ports, a single switch chip typically fits within a 1U form factor and draws approximately 300W; at 128 ports, you move to 2U and roughly 600W. The total cost of ownership crossover between radix sizes depends on cluster scale and optics cost. For clusters under 2,048 GPUs, 64-port switches usually minimize TCO. Between 2,048 and 8,192 GPUs, 128-port switches avoid an extra super-spine tier and reduce the total device count enough to offset the higher unit cost. Run the numbers with your specific optics pricing and per-kilowatt power rates — do not default to the highest radix available without a TCO model that accounts for the full 5-year lifecycle.

Plan for Incast Congestion with Proper ECN

Explicit Congestion Notification (ECN) enables switches to mark packets rather than drop them when queues build, allowing end-hosts to throttle before loss occurs. In RoCE v2 fabrics, ECN is mandatory for maintaining lossless behavior without the cascading PFC storms that plagued early 25G RoCE deployments. Configure ECN marking thresholds conservatively — start marking at 40% of buffer occupancy and apply a low, stable marking probability ramp — to give hosts time to reduce their injection rate before the buffer overflows. Pair ECN with DCQCN (Data Center Quantized Congestion Notification) at the NIC level for a closed-loop rate-control system. Test your ECN configuration under synthetic incast workloads before onboarding production training jobs; a misconfigured ECN profile is indistinguishable from a hardware failure at 3 AM.

Future-Proof with 800G-Ready Uplinks

The GPU roadmap shows no sign of slowing. Blackwell B200 ships with 800G ConnectX-8 NICs, and the Rubin platform promises 1.6T per GPU. A fabric built entirely on 400G today will be a bottleneck in 18 months when the next GPU generation arrives. The pragmatic approach: deploy 800G-capable spine ports today, even if you terminate them with 400G optics using breakout cables, so that upgrading to 800G requires only swapping the leaf-facing optics and server NICs — not a full spine hardware refresh. The marginal cost of 800G-capable switch silicon over 400G-only silicon is typically less than 15%, a small premium for doubling the upgrade headroom. Where possible, select switches that share the same ASIC generation across leaf, spine, and super-spine tiers to maintain consistent feature parity and management abstraction.

The Math

Bisection Bandwidth

For a fat-tree of radix k with N endpoints connected at the leaf tier:

B_bisect = (N × port_speed) / 2 // ideal non-blocking
B_leaf-uplink = (leaf_count × uplink_ports_per_leaf × port_speed)
B_{bisect_actual} = min(B_bisect, B_{spine-downlink})

Oversubscription Ratio

The oversubscription ratio quantifies how much contention exists when all endpoints demand full bandwidth simultaneously:

O = (total_leaf_downlink_bandwidth) / (total_leaf_uplink_bandwidth)
// O = 1 → non-blocking; O = 2 → 2:1 oversubscribed; O = 4 → 4:1

Radix Scaling — Tiers Required

The number of tiers T needed to support N endpoints with switch radix k (assuming half the ports connect downward, half upward at each tier):

endpoints_per_leaf = k / 2
leaf_count = N / (k / 2)
T = 2 // for leaf_count ≤ k (2-tier, no super-spine)
T = 3 // for leaf_count > k (3-tier with super-spine)

// Example: N = 2,048 endpoints, k = 64
leaf_count = 2,048 / 32 = 64 // 64 leaves, each serving 32 endpoints
// leaf_count (64) ≤ k (64) → 2-tier fat-tree suffices

// Example: N = 8,192 endpoints, k = 64
leaf_count = 8,192 / 32 = 256 // 256 leaves
// leaf_count (256) > k (64) → 3-tier fat-tree required

Total Switch Count

For a non-blocking fat-tree with radix k and t tiers:

S_total = S_leaf + S_spine + S_super-spine
S_leaf = N / (k / 2)
S_spine = S_leaf // for 2-tier non-blocking
S_super-spine = S_spine / (k / 2) // for 3-tier

GPU Performance Modeler

Model the compute throughput and HBM bandwidth for Blackwell and Hopper clusters.

Optical Transceiver Selection and Cable Plant Engineering

The optical transceiver and cable plant represent 30-50% of the total fabric cost in a large AI cluster, yet they are frequently treated as an afterthought in topology design. The choice between Direct Attach Copper (DAC), Active Optical Cables (AOC), and pluggable optical transceivers (QSFP, OSFP, QSFP-DD) has far-reaching implications for power consumption, signal integrity, cable management density, and operational reliability. Each technology occupies a different point on the cost-versus-reach trade-off curve, and selecting the wrong option for a given reach and data rate can result in link failures that are notoriously intermittent and difficult to diagnose.

DAC cables are the lowest-cost option (approximately $50-150 per 400G link) and consume zero additional power beyond the drive current from the host port. Their reach limitation is severe: at 400G using PAM4 signaling, a passive DAC is limited to approximately 2.5-3 meters before signal integrity degrades below the threshold for acceptable bit error rate (BER) even with Reed-Solomon forward error correction (RS-FEC). Active DACs, which include a small linear amplifier in the cable assembly, extend this reach to approximately 5 meters at a 20-30% cost premium. The practical implication for fabric designers is that DACs are viable only for in-rack server-to-leaf connections where the cable run stays within the same rack or between adjacent racks in the same row. Any cable that crosses a cold aisle or runs to a different row must use optical media.

Active Optical Cables (AOCs) provide a middle-ground option. The transceivers are permanently attached to a fiber cable assembly, eliminating the need for separate optics procurement and field cleaning of optical connectors. AOCs at 400G reach 30-100 meters and cost approximately $300-600 per link depending on length. They consume 3-5 watts per end beyond the host port power, compared to 8-12 watts for standard pluggable optics. The primary disadvantage of AOCs is their lack of field-serviceability: if a connector is damaged or a fiber breaks, the entire cable assembly must be replaced — not just a patch cord. In high-density fabrics with thousands of links, the mean time to repair a damaged AOC is 10-20x longer than replacing a standard LC duplex patch cord, making AOCs a poor choice for any cable run that passes through high-traffic walkways or cable trays that are frequently accessed.

Pluggable optical transceivers (QSFP56-DD, OSFP, QSFP-DD800) combined with standard single-mode fiber (SMF) patch cords remain the gold standard for production AI fabrics. At 400G, these transceilers use 4-lane or 8-lane PAM4 modulation over single-mode fiber with duplex LC connectors, achieving reaches of 500 meters to 10 kilometers depending on the transceiver class (SR4, DR4, FR4, LR4). The cost per link ranges from $800-2,500 depending on the reach and vendor. The key operational advantage is modularity: a damaged patch cord can be replaced in minutes without touching the transceivers, and transceivers can be moved between ports for troubleshooting or capacity rebalancing. The key operational challenge is optical connector contamination: a single speck of dust on a fiber end-face can cause optical return loss that degrades the signal-to-noise ratio across the entire PAM4 link, manifesting as uncorrectable FEC errors that appear randomly and disappear when the connector is re-seated. Every optical connection in a high-performance AI fabric must be inspected with a fiber microscope and cleaned with a click-cleaner before insertion. Fabrics where this cleaning protocol is not enforced consistently see 5-15x higher link failure rates.

The cable management density at 800G and 1.6T per port presents an additional physical engineering challenge. An OSFP connector at 800G occupies approximately the same panel space as a QSFP at 400G, but each 800G port requires two fibers (one transmit, one receive), meaning a 64-port switch at 800G requires 128 fiber strands terminated in duplex LC connectors on the switch faceplate. In a 10,000-port AI fabric, the total fiber count exceeds 20,000 individual strands — enough fiber to stretch for kilometers when laid end to end, all concentrated in cable trays above the switch rows. Effective cable management requires structured cabling with modular cassette-based fiber panels that allow individual links to be patched and documented without disturbing adjacent connections, combined with a cable labeling scheme that encodes the source switch, port number, and destination leaf in a machine-readable format (QR or barcode) to enable automated cable verification during deployment and troubleshooting.

GPU Performance Modeler

Model the compute throughput and HBM bandwidth for Blackwell and Hopper clusters.

Technical Standards & References

REF [NV-NET]

NVIDIA Networking (2024)

Analysis of High-Performance Interconnects in AI Clusters

VIEW OFFICIAL SOURCE

REF [CLOS-NET]

IEEE Standards (2023)

Clos Networks for High-Performance Computing

VIEW OFFICIAL SOURCE

REF [ARISTA-AI]

Arista Networks

RoCE v2 vs InfiniBand: A Comparative Study

VIEW OFFICIAL SOURCE

Mathematical models derived from standard engineering protocols. Not for human safety critical systems without redundant validation.

Adaptive Routing in Dragonfly+ Topologies: Deadlock Avoidance and Load-Balanced Path Selection

Dragonfly+ topology, an evolution of the original Dragonfly (Kim et al., 2008) and the Generalized Dragonfly (Dragonfly+), organizes switches into groups with all-to-all connectivity within a group and selective inter-group links. A Dragonfly+ configuration with G groups, each containing S switches, and each switch providing P endpoint ports, supports N = G × S × P endpoints. The all-to-all intra-group connectivity ensures that any two switches within the same group communicate in a single hop, while inter-group communication requires one or two additional hops. The critical routing challenge in Dragonfly+ is that the minimal path between two endpoints in different groups traverses exactly three links (source switch → local group adapter → remote group adapter → destination switch), but there are multiple minimal-path choices because each group has A = S inter-group links (one per switch in the group acting as a group adapter). Choosing the optimal path among these A alternatives — the adaptive routing problem — determines the fraction of traffic that achieves minimal latency vs. the fraction that suffers non-minimal detours due to congestion or path conflicts.

The progressive vs. conservative adaptive routing dichotomy defines the path selection strategy. Progressive adaptive routing evaluates the local output port availability at each hop and selects the least-congested port at that instant. At the source switch, the router checks all A inter-group links to the destination group and selects the link with the shortest occupied output queue depth. If the selected link's queue depth exceeds a threshold (typically 60-70% of the per-port shared buffer), the router declares the minimal path "congested" and chooses a non-minimal path: it forwards the packet to another switch within the same group (using the intra-group all-to-all links), which then forwards to a third group, and finally to the destination group via inter-group links. This non-minimal path consumes 5 hops instead of 3, adding 40-60% to the per-packet latency, but it avoids the congested minimal path and thereby prevents head-of-line blocking in the source switch's output buffer. The threshold for switching from minimal to non-minimal routing is the single most important tunable parameter in Dragonfly+ adaptive routing: a threshold that is too low causes excessive non-minimal routing (1.5-2.0× average hop count), while a threshold that is too high causes congestion trees that propagate across the fabric and collapse overall throughput by 30-45% under adversarial traffic patterns.

The deadlock avoidance mechanism in Dragonfly+ relies on virtual channel (VC) separation of minimal and non-minimal traffic. The Cray Slingshot interconnect (used in Frontier and El Capitan) implements three virtual channels per physical port: VC0 for minimal-path request packets, VC1 for non-minimal-path request packets, and VC2 for response packets. The separation ensures that a non-minimal packet blocked on a congested inter-group link does not block a minimal packet behind it in the same physical port's buffer — because they are in different VCs with independent buffering. The number of VCs required to guarantee deadlock freedom is determined by the maximum number of allowed non-minimal turns: each time a packet is diverted to a non-minimal path, it may transition from one VC to another, and the VC assignment must be acyclic to prevent buffer-level deadlock. For the Dragonfly+ topology where non-minimal paths can include up to T additional hops (T = 2 for a 2-hop detour), the minimum number of VCs is T + 1 = 3. The Intel OmniPath and Mellanox HDR InfiniBand implementations use 4 VCs for Dragonfly+ routing to support up to 3-hop non-minimal detours while maintaining deadlock freedom under all traffic patterns. Our fabric topology builder includes a Dragonfly+ Adaptive Routing Modeler that simulates the minimal vs. non-minimal path selection probability as a function of the congestion threshold, per-VC buffer allocation, and traffic matrix (uniform random, bit complement, neighbor, or all-to-all collective patterns). The output includes the expected hop count distribution, the maximum throughput relative to the non-blocking fat-tree baseline, and the VC buffer allocation that minimizes tail latency for the selected traffic pattern.

The load-balanced Dragonfly+ routing with Valiant's algorithm — also known as Valiant Load Balancing (VLB), adapted from Valiant's 1982 randomized routing for hypercubes — selects the intermediate group uniformly at random among all G groups (including the source and destination groups). The packet is first routed from the source group to the intermediate group (one inter-group hop), then from the intermediate group to the destination group (second inter-group hop). Because the intermediate group is selected uniformly at random, the inter-group link load is perfectly balanced regardless of the traffic pattern — achieving the property of fate-sharing where no single congestion event can dominate the fabric utilization. VLB's cost is an extra hop (5 hops total for Dragonfly+: source switch → local group adapter → intermediate group → destination group adapter → destination switch), increasing the average latency by 50-67% compared to the minimal 3-hop path. In practice, Dragonfly+ deployments use a hybrid threshold-adaptive routing with VLB: 90% of packets take the minimal adaptive path, and 10% are sent via VLB-based random intermediate routing to drain residual imbalance in the inter-group link utilization. This 90/10 split produces near-optimal load balance (per-link utilization within 5% of the mean) while keeping the average hop count at approximately 3.2 (vs. 3.0 for pure minimal routing and 5.0 for pure VLB). Our Dragonfly+ modeler implements this hybrid approach and recommends the optimal adaptive-routing/VLB split ratio based on the G, S, P configuration and the estimated percentage of all-to-all collective traffic in the workload.

Partner in Accuracy

"You are our partner in accuracy. If you spot a discrepancy in calculations, a technical typo, or have a field insight to share, don't hesitate to reach out. Your expertise helps us maintain the highest standards of reliability."

Contributors are acknowledged in our technical updates.

AI FabricDesigner

Fabric Topology Builder

Architectural Insights

Beyond the Leaf: The Fat-Tree Standard

Rail Alignment

Radix Scaling

InfiniBand vs. RoCE v2

Using This Tool

Design Scenarios

512-GPU LLM Training Cluster

4,096-GPU Inference Farm

16K-GPU Frontier-Scale

Common Pitfalls

Oversubscription Without Headroom During All-Reduce

Ignoring Rail Alignment for GPU Clusters

DAC Cable Length Limitations at 400G/800G

Under-Provisioning Switch Buffer for Incast

Best Practices

Design for the Collective, Not the Average

Right-Size Switch Radix for TCO

Plan for Incast Congestion with Proper ECN

Future-Proof with 800G-Ready Uplinks

The Math

Bisection Bandwidth

Oversubscription Ratio

Radix Scaling — Tiers Required

Total Switch Count

GPU Performance Modeler

Optical Transceiver Selection and Cable Plant Engineering

GPU Performance Modeler

Technical Standards & References

Adaptive Routing in Dragonfly+ Topologies: Deadlock Avoidance and Load-Balanced Path Selection

AI Fabric
Designer