Data vs. Model Parallelism: Networking Impact on Scale

The Scaling Hierarchy.

Modern LLMs (like GPT-4 or Llama 3) have trillions of weights. These literal "monsters" cannot fit on the 80GB memory of a single H100 GPU. Even the 141GB of an H200 is insufficient for the training gradients and optimizer states.

Distributed training solves this by splitting the workload. However, each splitting strategy imposes a different burden on your physical network fabric—from "Burst-Heavy" data movement to "Latency-Critical" weight sharing.

Data Parallelism (DP)

Every GPU has a FULL copy of the model. Each GPU handles a DIFFERENT chunk of data.

BWS-Heavy (All-Reduce)

Tensor Parallelism (TP)

A SINGLE LAYER is split across multiple GPUs. Operations (like MatMul) are split.

Latency-Critical (NVLink)

Pipeline Parallelism (PP)

Layers are sequential. GPU 1 does layer 1-5, GPU 2 does 6-10.

Point-to-Point (P2P)

The All-Reduce Stress Test.

In Data Parallelism, at the end of every forward and backward pass, GPUs must aggregate their gradients. This is usually done via a **Cuda All-Reduce** operation. If your network isn't "Non-Blocking" (i.e. Fat-Tree), the congestion during this collective phase will cause a "Communication Wall," where GPUs sit idle waiting for the fabric.

Throughput Modeler.

Predict the communication overhead of your 3D Parallelism strategy (TP x PP x DP) before you spend millions on H100 cluster time.

Bisection Bandwidth Calculus for 3D Parallelism Topologies

The effective throughput of a distributed training run is determined by the interaction between the parallelism strategy and the physical network topology. Tensor Parallelism (TP) requires the highest bandwidth and lowest latency, so it must stay within a single NVLink domain. Pipeline Parallelism (PP) generates point-to-point activation traffic with bounded buffer sizes determined by the micro-batch count. Data Parallelism (DP) produces bursty All-Reduce traffic that scales with the global batch size and model dimension.

The bisection bandwidth constraint is most acutely felt during the All-Reduce phase. For a model with M parameters trained with DP degree D, each gradient tensor has size 4M/D bytes (assuming FP32 gradients). The total data moved per All-Reduce step under a ring algorithm is 2(D-1) x 4M/D bytes. For a 175B parameter model with D=1024, this equals approximately 1.4 TB of traffic that must traverse the fabric bisection bandwidth in under one training step communication window.

In a Dragonfly+ topology with 4,096 GPUs organized into 64 groups (64 GPUs per group) with 8 x 400 Gbps global links per group, the global bisection bandwidth is 64 x 8 x 50 GB/s = 25.6 TB/s. The minimum All-Reduce time is therefore 1.4 TB / 25.6 TB/s = 55 ms. If the training step compute time is 200 ms, the communication overhead is 27.5%, matching empirically observed scaling efficiencies for GPT-class models at this node count.

Practical tuning involves adjusting the micro-batch size to overlap communication with computation. Megatron-LM uses overlapped All-Reduce, where gradient computation for one micro-batch overlaps with All-Reduce of previous gradients. The optimal overlap ratio requires the compute time per micro-batch to be greater than or equal to the All-Reduce time of the gradient bucket. The Tensor Parallelism degree directly constrains the minimum possible All-Reduce chunk size because each TP rank holds only a partition of the weight matrix, reducing the per-rank gradient volume proportionally.

Expert Parallelism Network Topology Sensitivity

Expert Parallelism (EP) — the sharding of MoE expert parameters across GPUs — creates a fundamentally different network traffic pattern than Data Parallelism or Tensor Parallelism. In EP, each GPU holds a subset of the expert weights, and tokens must be routed to the GPU hosting their assigned expert via an **All-to-All** collective. Unlike the ring-based All-Reduce of Data Parallelism where traffic patterns are predictable and uniform, All-to-All is a permutation-based pattern where every GPU sends a different amount of data to every other GPU. This asymmetry makes EP extremely sensitive to the network topology and load balancing scheme.

The key metric for EP network sensitivity is the **Permutation Matrix Sparsity** — the fraction of GPU pairs that must exchange data in a given training step. In a 64-GPU expert parallel group with 128 experts, each GPU hosts 2 experts and processes 64 tokens per step. Each token is routed to one of 128 experts, creating a 64 x 64 permutation matrix with an average density of 3.1% (each GPU sends to approximately 2 of 63 peers). The All-to-All collective must route these 64 tokens per GPU across the fabric, with each token representing a 2 KB activation vector. The total data volume per step is 64 GPUs x 64 tokens x 2 KB = 8 MB — trivial bandwidth-wise but latency-critical because the All-to-All is a barrier operation.

The latency sensitivity arises from the **Straggler Effect** in All-to-All. The collective completes only when the last GPU has received all its tokens. In a Dragonfly topology with 4 global links per group, a GPU that must send 10 tokens to a GPU in a different group competes for the limited global bandwidth. If the global link is congested with other All-to-All traffic, that GPU's token delivery is delayed, stalling the entire EP group. The straggler penalty in a 64-GPU EP group on a Dragonfly network can reach 50 microseconds over the ideal 5-microsecond All-to-All completion time — a 10x slowdown that directly increases the training step time.

The mitigation is **Topology-Aware Expert Placement** — placing experts on GPUs in the same network group to minimize cross-group traffic. DeepSpeed's EP scheduler assigns expert replicas such that 80% of token routing stays within the same leaf switch domain, reducing global link utilization by 75%. The scheduler uses a **Routing Affinity Matrix** built from the router's historical expert selection patterns to predict future routing and place experts accordingly. In production deployment on a 4,096-GPU cluster, topology-aware EP placement reduces the All-to-All completion time from 55 microseconds to 12 microseconds, improving the overall training throughput by 8% for MoE models with 128+ experts.

Sharding
Models.

Parallelism: Choosing Your Bottleneck