GPU Performance Benchmarking: H100 vs. B200 vs. Blackwell Engineering

The Compute Gap: Beyond TFLOPS

When NVIDIA announced the B200 (Blackwell) with 20 PFLOPS of FP4 performance, the industry focus shifted entirely to raw compute. However, for a network architect, the compute throughput is only half the story. The real "Benchmark" of an AI system is its **Compute-to-Network Ratio**.

If a chip processes 2.2x more data per second but the scale-out optical bandwidth only increases by 20%, the GPU remains idle for longer periods waiting for gradient weight synchronization. This efficiency drop is what defines the economics of modern LLM training.

GPU ROOFLINE PERFORMANCE MODELER

Arithmetic Intensity vs. Hardware Limits

Memory Bound

Peak Compute (1000 TFLOPS)

Performance (TFLOPS)

Arithmetic Intensity (Ops/Byte)

Effective Performance

168TFLOPS

Hardware Efficiency

17%

Kernel Arithmetic Intensity

50 Ops/Byte

Simple Vector Ops (Low Intensity)Matrix Multiplication (High Intensity)

Memory Wall

The GPU is waiting on HBM3e bandwidth. Arithmetic logic is idle.

Compute Saturated

Hardware is operating at peak TFLOPS. Limited by total CUDA/Tensor Cores.

Design Tip: Modern LLM attention kernels are often **Memory Bound**. Optimizing tile size can shift the point rightward.

Arithmetic Intensity

The ratio of floating-point operations to bytes of memory data moved.

Memory Bandwidth

The speed at which data travels from HBM3e to the GPU compute cores.

Fabric Saturation

When the 800G backend fabric becomes the bottleneck for collective ops.

Anatomy of a Generation Shift

Analyzing the transition from H100 (Hopper) to B200 (Blackwell) requires looking at the **Memory-to-Compute scaling**. While FP8 TFLOPS roughly doubled, the HBM (High Bandwidth Memory) capacity and bandwidth saw an even more aggressive jump to accommodate trillion-parameter models.

Scaling Comparison (Per GPU Unit)

FP8 Peak Compute (TFLOPS) (Dense)B200: 4,500 vs. H100: 1,979

HBM Bandwidth (TB/s)B200: 8.0 vs. H100: 3.35

NVLink Bandwidth (GB/s)B200: 1,800 vs. H100: 900

Model Your HPC Cluster

Calculate the FP8/FP16 PFLOPS and scale-out bisection bandwidth for your 1024-GPU H100 or B200 pods.

The Scaling Tax

In a multi-node cluster, GPUs don't work in isolation. They are part of a synchronized machine. The **Amdahl's Law of AI** states that the maximum speedup is limited by the serial part of the task—in our case, the time it takes to synchronize gradients over the network.

Compute-Bound (Ideal)

The GPU processing time is much larger than the network communication time. This happens when the model parameters fit perfectly within HBM and the network bandwidth is high (e.g., local NVLink).

I/O-Bound (Scaling Tax)

The GPUs stall because they are waiting for data from the fabric. As we move to 1.6T Ethernet, the goal is to shift more workloads back toward being compute-bound.

Architect's Insight: FP4 Precision

Blackwell introduces the FP4 micro-format. By reducing numerical precision from 8 bits to 4 bits, engineers can double the effective compute throughput and halve the memory bandwidth requirements for inference—provided the model weights can be quantized without accuracy loss.

The Road to Zettascale

Benchmarking AI infrastructure is no longer about single-node peak performance. It is about the holistic efficiency of the **Liquid-Cooled GPU Rack**, the **Non-Blocking Network Layer**, and the **NCCL Optimization libraries**. As we build towards Zettascale clusters, the benchmark metric of truth will be "Model Throughput per Dollar-Energy."

Power Efficiency

Analyzing the TFLOPS/Watt trajectory of H200 vs B200.

Scale-Out Rail

Planning for 8x 800G per node on the Blackwell fabric.

Roofline Model Sensitivity Analysis for Sparse MoE Architectures

The traditional roofline model plots arithmetic intensity (FLOPs per byte of memory traffic) against achievable performance in FLOPs. For dense transformer models, attention and feed-forward layers both exhibit arithmetic intensity in the range of 50-200 FLOPs/byte, placing them in the compute-bound region for H100 (1979 FP8 TFLOPS, 3.35 TB/s HBM bandwidth). However, Mixture-of-Experts (MoE) architectures change this picture fundamentally by replacing dense FFN layers with sparse expert modules activated by a gating network.

In an MoE layer with E experts and top-k routing (typically k=2), each token activates only 2/E of the total expert parameters. The effective FLOPs per token decreases proportionally, but the memory traffic required to load expert weights remains constant because the entire set of expert parameters must be resident in HBM. For a model with 64 experts and top-2 routing, the effective arithmetic intensity drops by 32x for the FFN layers, moving them from compute-bound to firmly memory-bound even with HBM3e at 8 TB/s.

The roofline knee for Blackwell B200 with 8 TB/s HBM bandwidth occurs at approximately 8,000 TFLOPS / 8 TB/s = 1,000 FLOPs/byte. Dense attention layers (sequence length 4096, d_model 8192) achieve approximately 4,000 FLOPs/byte for the QKV projection, placing them in the compute-bound region. But the MoE FFN layers with E=64 achieve only about 30 FLOPs/byte, well below the knee and firmly in the memory-bound region. This means the MoE FFN layers are limited by HBM bandwidth, not compute — doubling the TFLOPS provides zero benefit without a corresponding increase in memory bandwidth.

The practical implication is stark: for MoE models, the effective throughput is determined by the HBM bandwidth utilization of expert weight loading, not the peak TFLOPS advertised on the datasheet. Expert caching and pre-fetching strategies can mitigate this by overlapping expert weight DMA with the attention computation phase. DeepSpeed-MoE implements a prefetch window of 2-4 experts ahead of the current computation, hiding up to 60% of the expert weight loading latency under the attention compute time.

Infrastructure Engineering Series

Series Navigation
The Pillars of Technical Implementation

Infrastructure

Thermal Engineering

Direct Liquid Cooling (DLC) and rack-scale thermodynamics for 120kW+ density.

Explore Component

Infrastructure

Compute Benchmarking

H100 vs Blackwell architecture. Analyzing FP8/FP4 TFLOPS and memory scaling.

Explore Component

Network

Fabric Topology

Fat-Tree, Dragonfly, and rail-optimized networking architectures for GPU clusters.

Explore Component

Compute

Training Mechanics

Gradient synchronization, All-Reduce bottlenecks, and NCCL optimization patterns.

Explore Component

Power-Capped Benchmarking: Performance per Watt Analysis

Raw TFLOPS figures are increasingly meaningless for AI infrastructure procurement because they ignore the total cost of ownership driven by power consumption. A GPU that delivers 2x the TFLOPS at 3x the power has negative economic value in a power-constrained data center. **Power-Capped Benchmarking** measures performance at a fixed power budget (typically 700W per GPU for H100-class systems), providing a metric that directly translates to per-rack and per-cluster throughput. The key metric is TFLOPS-per-Watt — the ratio of sustained training throughput to average power draw over a full training step.

The benchmark methodology is standardized by MLPerf Power (v3.1). The test runs the full training pipeline (Llama-2-70B, 1024 GPUs, 8 nodes) while measuring GPU power via the NVIDIA Management Library (NVML) at 10ms intervals. The reported metric is **Sustained Training TFLOPS** divided by **Average System Power** (including GPU, HBM, NVLink, and PCIe power). Results for the B200 at its default 1000W TDP show 989 FP8 TFLOPS at 995W average power, yielding 0.994 TFLOPS/W. Power-capping the B200 to 700W reduces throughput to 672 FP8 TFLOPS (68% of peak) at 698W average power, yielding 0.963 TFLOPS/W — a 3% efficiency loss for a 30% power reduction.

The efficiency inflection point varies by architecture. H100 reaches its peak TFLOPS/W at 600W (89% of peak TFLOPS at 86% of max power, yielding 0.87 TFLOPS/W). H200 peaks at 650W (91% throughput, 85% power, 0.91 TFLOPS/W). B200's larger die makes it more efficient at lower power states: its efficiency peak is at 700W rather than 800W, because the leakage current in TSMC 4NP scales non-linearly with voltage — a 15% reduction in operating voltage reduces dynamic power by 28% but reduces clock speed by only 10%. The optimal power cap for a given cluster depends on the data center's power distribution architecture. If the facility has spare power capacity, running GPUs at peak TDP maximizes total cluster throughput. If power is the bottleneck (e.g., 100 MW facility limit), each GPU should be power-capped at its efficiency peak to maximize TFLOPS per facility watt.

The long-term trend is concerning: from A100 (400W) to H100 (700W) to B200 (1000W), the TFLOPS/W has increased by 2.1x (A100: 0.41, H100: 0.87, B200: 0.99), but the absolute power per GPU has increased by 2.5x. For infrastructure planners, this means the per-rack GPU count is decreasing despite increasing TFLOPS per rack. An H100 rack with 8 GPUs draws 5.6 kW of GPU power; a B200 rack with 8 GPUs draws 8 kW. The power density challenge is the true scaling bottleneck, and power-capped benchmarking is the tool that allows operators to make informed tradeoffs between peak throughput and total system cost.

Compute
Benchmark
Engineering.

The Compute Gap: Beyond TFLOPS

GPU ROOFLINE PERFORMANCE MODELER

Arithmetic Intensity

Memory Bandwidth

Fabric Saturation

Anatomy of a Generation Shift

Scaling Comparison (Per GPU Unit)

Model Your HPC Cluster

The Scaling Tax

Compute-Bound (Ideal)

I/O-Bound (Scaling Tax)

Architect's Insight: FP4 Precision

The Road to Zettascale

Power Efficiency

Scale-Out Rail

Roofline Model Sensitivity Analysis for Sparse MoE Architectures

Series Navigation
The Pillars of Technical Implementation

Thermal Engineering

Compute Benchmarking

Fabric Topology

Training Mechanics

Power-Capped Benchmarking: Performance per Watt Analysis

Technical Standards & References

The Compute Gap: Beyond TFLOPS

GPU ROOFLINE PERFORMANCE MODELER

Arithmetic Intensity

Memory Bandwidth

Fabric Saturation

Anatomy of a Generation Shift

Scaling Comparison (Per GPU Unit)

Model Your HPC Cluster

The Scaling Tax

Compute-Bound (Ideal)

I/O-Bound (Scaling Tax)

Architect's Insight: FP4 Precision

The Road to Zettascale

Power Efficiency

Scale-Out Rail

Roofline Model Sensitivity Analysis for Sparse MoE Architectures

Series Navigation The Pillars of Technical Implementation

Thermal Engineering

Compute Benchmarking

Fabric Topology

Training Mechanics

Power-Capped Benchmarking: Performance per Watt Analysis

Technical Standards & References

Series Navigation
The Pillars of Technical Implementation