Parallelism: Choosing Your Bottleneck
The Scaling Hierarchy.
Modern LLMs (like GPT-4 or Llama 3) have trillions of weights. These literal "monsters" cannot fit on the 80GB memory of a single H100 GPU. Even the 141GB of an H200 is insufficient for the training gradients and optimizer states.
Distributed training solves this by splitting the workload. However, each splitting strategy imposes a different burden on your physical network fabric—from "Burst-Heavy" data movement to "Latency-Critical" weight sharing.
Data Parallelism (DP)
Every GPU has a FULL copy of the model. Each GPU handles a DIFFERENT chunk of data.
Tensor Parallelism (TP)
A SINGLE LAYER is split across multiple GPUs. Operations (like MatMul) are split.
Pipeline Parallelism (PP)
Layers are sequential. GPU 1 does layer 1-5, GPU 2 does 6-10.
The All-Reduce Stress Test.
In Data Parallelism, at the end of every forward and backward pass, GPUs must aggregate their gradients. This is usually done via a **Cuda All-Reduce** operation. If your network isn't "Non-Blocking" (i.e. Fat-Tree), the congestion during this collective phase will cause a "Communication Wall," where GPUs sit idle waiting for the fabric.
