InfiniBand SM & Routing Modeler
A precision simulator for high-performance fabric management. Calculate LFT/MFT requirements and model SM sweep intervals for hyperscale clusters.
Fabric Configuration
Total Endpoints
LIDs Required
Path Lookup
SM Memory
Subnet Manager Scaling
128 switches × 8 ports per switch
LID Space Usage
2.0%
Routing Entries
16,384
Failover Time
26s
"Large-scale fabrics benefit from hierarchical SM configurations for faster convergence."
1. The Central Brain: Understanding SM Authority
In InfiniBand, a switch is 'Dumb' until the SM tells it how to route. This is essentially Software Defined Networking (SDN) in its purest hardware form.
Address Space Physics
When a new node is plugged in, the SM assigns it a **Local Identifier (LID)**. This is a one-time operation. However, the SM must then push an updated **Linear Forwarding Table (LFT)** to every other switch in the fabric so they know how to reach the new node.
2. The 100ms Sweep: Scaling for Micro-Failover
In traditional HPC, a 'Sweep' of 30 seconds was acceptable. In AI training, where a 30-second stall can cost thousands of dollars, we need **Heavy Sweep Optimization**.
Trap-Based Discovery
Instead of polling, the SM waits for an 'IB Trap' (Link State Change). It then performs a 'Heavy' sweep of only the affected branch.
Adaptive Routing Updates
By coordinating SM updates with ASIC **Adaptive Routing**, the fabric can reroute traffic in hardware (ns timescale) while the SM works on the long-term (ms/s) topology update.
3. Topology Constraints: Fat-Tree vs. DragonFly
The SM's routing engine must be configured for the specific physical layout of the cluster.
Routing Logic
1. **Fat-Tree**: Predictable, non-blocking pathing. Requires 'Up/Down' routing to prevent loops. The SM can calculate this very quickly even at 32K nodes.
2. **DragonFly**: Low-diameter, but higher path-finding complexity. The SM must balance LFT entries to avoid hot-spots in the inter-group links.
3. **3D Torus**: Highly efficient for physical neighbor communication (e.g. climate modeling), but suffers from slow reconvergence if a central link is severed.
4. SM Forensics: Identifying Routing Stalls
Monitoring the health of the Subnet Manager is the first step in cluster troubleshooting.
Frequently Asked Questions
Technical Standards & References
Related Engineering Resources
"You are our partner in accuracy. If you spot a discrepancy in calculations, a technical typo, or have a field insight to share, don't hesitate to reach out. Your expertise helps us maintain the highest standards of reliability."
Contributors are acknowledged in our technical updates.
