The ECMP Collision Problem.

Standard IP networks use **ECMP (Equal Cost Multi-Pathing)**. Each packet is hashed (IP Source/Dest, Port) and assigned to a static link. If two heavy AI flows hash to the same link, that link saturates while others sit idle at 0% load.

This is the **Elephant Flow Problem**. In a GPU cluster, every flow is an Elephant. Static hashing in a Fat-Tree topology leads to "Hot Paths," causing queuing delays and Packet Drops that ultimately throttle the collective training pass.

Ethernet ECMP

  • Static Flow Hashing (Deterministic)
  • Risk of Hash Polarization/Hot Spots
  • Cannot Rebalance During a Flow

Adaptive Routing

  • Dynamic Packet-Level Spraying
  • Near-Perfect Load Equilibrium
  • Routes Around Failed Switch Links

SHARPv3 In-Network Computing.

In-Network Computing (SHARP) takes routing a step further. Instead of just moving data, the **Switch Fabric** itself performs the All-Reduce operation. It collects the gradients from eight GPUs, sums them in the switch's ASIC, and broadcasts the result back. This cuts the network data volume in **half**.

Collision Modeler.

Simulate how many "Hot Paths" your Ethernet cluster will develop as you scale. Understand the TFLOPS loss from static hashing.

Share Article

Technical Standards & References

REF [ib-routing]
InfiniBand Trade Association (2024)
Advancements in Adaptive Routing for High Performance Computing
Published: NVIDIA Quantum Systems Engineering
VIEW OFFICIAL SOURCE
REF [ethernet-ecmp-limit]
Google Networking Research (2023)
Mitigating Flow Collisions in Multi-Stage Clos Networks
Published: OFC Research
VIEW OFFICIAL SOURCE
Mathematical models derived from standard engineering protocols. Not for human safety critical systems without redundant validation.