NVIDIA Blackwell B200: Architecture of the 20 PFLOPS GPU

The End of the Monolithic GPU.

The NVIDIA Blackwell architecture represents the most significant architectural leap since the introduction of CUDA in 2006. For the first time in its history, NVIDIA has moved beyond the reticle limit of a single silicon die. The **B200** is a multi-die GPU, consisting of two limit-sized dies connected by a **10 TB/s NVLink-C2C (Chip-to-Chip) interconnect**. This isn't just a "larger chip"; it's a fundamental reimagining of what a processor can be.

To the software developer and the underlying operating system, this dual-die system appears as a single, massive GPU. This transparent scaling is achieved through deep hardware-level cache coherency protocols, ensuring that every Streaming Multiprocessor (SM) on Die A can access memory on Die B with the exact same latency as its own local memory. This solves the "NUMA" problem (Non-Uniform Memory Access) that has plagued high-end computing for decades.

With **208 billion transistors** and an unprecedented **20 PFLOPS of FP4 compute**, Blackwell isn't just a component—it is a computational node that was previously an entire rack. It is designed specifically to handle models with multi-trillion parameters, such as GPT-5 and beyond, which require TBs of inter-die communication every millisecond.

Technical Pillar: 2nd Gen Transformer Engine

The secret to Blackwell's 5x throughput gain over Hopper lies in the **Second Generation Transformer Engine**.

This engine introduces **FP4 (4-bit Floating Point)** precision as a first-class citizen. By using ultra-low precision math for the majority of the transformer weights, Blackwell can process data at exactly twice the speed of FP8. Crucially, NVIDIA's hardware management of dynamic scaling factors (Microscaling) ensures that there is virtually **zero accuracy degradation** compared to FP16. This is achieved by the hardware constantly monitoring the statistical distribution of weights and activations and adjusting the 4-bit representation in real-time.

When combined with the **1.8 TB/s NVLink 5**, Blackwell provides a 30x reduction in total cost and energy consumption compared to the previous generation for certain LLM inference tasks. We are finally seeing the "Law of Wholesome Gains" where every part of the system—from memory to math to networking—scales in perfect unison to achieve a new level of AI capability.

20 PFLOPS

FP4 Peak Performance

1.8 TB/s

NVLink 5 Bidirectional BW

192 GB

HBM3e Memory Capacity

I. The NVLink 5 Interconnect: A Data Center Fabric

NVLink has transitioned from a card-to-chip bridge into a full-scale network fabric. In the Blackwell generation, NVLink 5 delivers twice the bandwidth per lane compared to Hopper, but more importantly, it introduces a whole new routing logic for multi-node clusters.

SHARP v4: In-Network Reductions

One of the most critical breakthroughs in Blackwell is the **Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) v4**. In traditional distributed training, GPUs must spend significant cycles performing "All-Reduce" operations—averaging gradients across thousands of nodes. With SHARP v4, this computation is offloaded directly to the **NVLink Switch chip**.

The switch chip itself performs the mathematical reduction at 14.4 TB/s. This means the GPUs never have to "stop and talk" to one another; they simply fire their gradients into the fabric, and the fabric returns the averaged results. This results in a **2x performance gain** for synchronization-heavy workloads like large-scale LLM training.

NVL72 Spine-Leaf Topology

Inside the GB200 NVL72 rack, the topology is a non-blocking **Fat-Tree**. Every GPU has a direct, dedicated path to all 13.5 TB of memory in the rack via the NVLink Switch backplane. This provides **57.6 TB/s of aggregate bisection bandwidth**—a throughput level that makes traditional InfiniBand networks look like dial-up.

Multi-Rack Dragonfly

When scaling beyond a single rack (up to 32,768 GPUs), Blackwell utilizes a **Dragonfly topology**. This minimizes the number of "long-haul" optical cables required between rows while maintaining extremely low diameter (max hops). NVIDIA's **Unified Fabric Manager (UFM)** dynamically reroutes traffic to avoid congestion hot-spots in these massive meshes.

II. FP4 Precision: The Math of the Future

Why FP4? Because entropy in neural network weights is lower than we previously thought, and Blackwell is the first chip to exploit this reality at the silicon level.

Traditional quantization (like INT8 or FP8) applies a single scaling factor to an entire tensor. However, LLM weights are non-uniform—they have "outliers" that carry disproportionate information. If you quantize the whole tensor based on the outliers, the majority of the weights lose resolution.

MXFP4: Microscaling Logic

Blackwell utilizes the **MXFP4** format, which introduces **Fine-Grained Microscaling**. Instead of one scale factor per tensor, it applies a 6-bit or 8-bit scale factor to every **8-element block** of weights. This "micro-windowing" ensures that even with 4-bit precision, the outliers are preserved and the signal-to-noise ratio (SNR) remains higher than even standard FP8 formats. This hardware-level management of dynamic range is handled by the **2nd Gen Transformer Engine** in real-time.

Throughput & Energy

Switching from FP8 to FP4 effectively doubles the arithmetic throughput while halving the memory pressure on the HBM3e subsystem. In the Blackwell architecture, these energy savings are reinvested back into higher clock speeds. The result is a **25x energy efficiency improvement** for 1.8-trillion parameter model inference compared to Hopper's FP8 paths.

Format	Precision	Throughput (B200)	Ideal Use Case
MXFP4	4-bit Floating Point	20,000 TFLOPS	Large-Scale Inference (GPT-5/AGI)
FP8	8-bit Floating Point	10,000 TFLOPS	Mainstream Training / Inference
FP16	16-bit Floating Point	5,000 TFLOPS	High-Precision Fine-Tuning

III. GB200 NVL72: The Rack is the Unit of Compute

With Blackwell, NVIDIA is moving from selling "Server Boards" to selling full "Liquid-Cooled Racks." The primary vehicle for Blackwell is the GB200 NVL72.

72x GPUs

13.5 TB HBM3e

Connected via NVLink into a massive, coherent shared memory pool. To the operating system and the CUDA compiler, this complex rack looks like a single large workstation with a 13TB VRAM buffer. This is the first time we've seen single-rack scale-out at this level of bandwidth (57.6 TB/s total bidirectionally).

36x Grace CPUs

2,500+ Cores

The Grace CPU works in tandem with Blackwell B200, handling the sequential logic, operating system tasks, and feeding the AI engines with zero bottleneck via the ultra-fast C2C link. The Grace-Blackwell (GB200) superchip is the most integrated compute module in human history.

"In the Blackwell era, the data center is the new unit of compute. The line between a single chip and a massive network is now completely blurred, creating a singular AI organism."

IV. The Physics of NVLink-C2C: Unified Silicon

How do you merge two dies into one without the software ever knowing? The B200 (Blackwell board) uses a specialized Silicon Interposer with a protocol called **NVLink-C2C**.

CoWoS-L: Beyond the Reticle Limit

A single silicon die cannot exceed the "Reticle Limit" (~850mm┬▓)—the maximum size of a lithography step. To overcome this, NVIDIA uses **TSMC CoWoS-L** (Chip-on-Wafer-on-Substrate with Local silicon interconnect).

The two Blackwell dies are "stitched" together on a massive interposer using dedicated silicon bridge layers. These bridges facilitate the **NVLink-C2C** protocol, which provides a massive 10 TB/s across the die-to-die boundary. This is not just a high-speed bus; it is a coherent memory fabric. If Die A is running an operation that needs weights stored in Die B's HBM3e pool, it can access them with **zero software intervention** and nanosecond-scale latency.

C2C Throughput

10 TB/s

25x higher bandwidth than PCIe Gen 5; 0.5pJ/bit energy cost.

Transistor Density

208B

Combined transistor count for the dual-die system.

V. Geopolitical Forensics: The Blackwell Supply Chain

The complexity of producing Blackwell is so immense that it is reshaping the global technology landscape and national security priorities.

Blackwell leverages **TSMC's 4NP node**, a highly optimized variant of the 4nm process. However, the true bottleneck isn't the logic die—it's the HBM3e and the **CoWoS-L** (Chip-on-Wafer-on-Substrate-Large) packaging. Blackwell requires an interposer nearly 3x the size of Hopper's, pushing the theoretical limits of silicon lens stitching technology. This scale makes Blackwell chips some of the largest functioning electronic objects ever mass-produced.

This has led to a massive consolidation in the global AI supply chain. Companies that can provide precision liquid cooling subsystems (CDUs), high-power busbars capable of carrying thousands of amps, and rapid HBM3e testing are seeing unprecedented growth. Every Blackwell rack consumes as much power as a small neighborhood (~120kW), necessitating an entire ecosystem of specialized power engineering that simply didn't exist at this scale two years ago.

VI. RAS & Confidential Computing: The Security Engine

As AI models become the most valuable intellectual property on earth, securing the weights and data during computation is no longer optional. Blackwell introduces the **Blackwell Security Engine**, a dedicated hardware subsystem designed for multi-tenant AI clouds.

TEE (Trusted Execution Environments)

Blackwell supports hardware-level TEEs where model weights and data are encrypted in HBM3e and only decrypted *inside* the GPU logic. This prevents even the cloud provider or a compromised OS kernel from snooping on the computation. The performance penalty for this line-rate encryption is effectively zero, thanks to dedicated AES-256 engines in the memory controllers.

SDE (Silent Data Error) Detection

At 20 PFLOPS, even a single bit flip can cause a "catastrophic forgetfulness" event in an LLM. Blackwell's RAS subsystem continuously monitors the logic gates and memory arrays for silent errors, using high-order parity checks and autonomous recovery logic. It can identify a failing SM (Streaming Multiprocessor) and transparently reroute the workload to a healthy one before a training crash occurs.

VII. Power Hydraulics: The 120kW Vertical PDN

The GB200 NVL72 rack consumes **120kW** of power—the highest power density in the history of the data center. To deliver this much energy without melting the copper busbars, NVIDIA transitioned to a **48V Power Distribution Network (PDN)**.

In traditional 12V systems, clear currents would reach 10,000 Amps, leading to massive resistive (I┬▓R) losses. By using 48V all the way to the compute board, NVIDIA reduces cable thickness and heat loss by **16x**. The board-level **Vertical Power Delivery** modules then perform a final, high-efficiency conversion to the ~1V required by the silicon dies, placing the voltage regulators directly beneath the GPU to eliminate transient noise and inductive drop.

VIII. The Cooling Crisis: 120kW per Rack

Heat is the primary limit of computation. At 120kW per rack, the Blackwell NVL72 generation has officially broken the air-cooling model of the modern data center.

To manage Blackwell's heat flux, NVIDIA and its partners have moved to **Manifold-Based Liquid Cooling**. Chilled water flows through a closed-loop system directly over the GPU and CPU cold plates. This allows the rack to maintain an ambient server room temperature of 25°C while the chips are running at 100% load. Without this liquid cooling revolution, Blackwell would never be able to reach its 20 PFLOPS potential without melting the substrate.

IX. TensorRT-LLM: The Software Model for Blackwell

The Blackwell B200 is more than just silicon; it is a software-defined architecture. **TensorRT-LLM** has been rebuilt from the ground up to exploit Blackwell's specific hardware hooks, including:

Speculative Decoding

Blackwell hardware can execute two models simultaneously—a smaller "draft" model and a larger "verification" model. This hardware-native speculative decoding increases inference throughput by up to **3x** compared to pure software implementations.

Chunked Attention

By breaking KV-caches into smaller physical chunks that fit within Blackwell's massive 192GB HBM3e pool, TensorRT-LLM eliminates the "memory wall" for long-context windows (up to 128k tokens).

X. Real-World Impact: Beyond Just Chatbots

While the world is focused on LLMs, Blackwell will revolutionize the sciences at a fundamental level.

Drug Discovery

Blackwell's FP4 engines can simulate protein folding (AlphaFold) at a speed 10x faster than previous clusters. This allows for real-time virtual screening of billions of chemical compounds, potentially shortening drug development timelines from 10 years to 1 year.

Weather Prediction

Global weather models (FourCastNet) require massive memory bandwidth to handle petabytes of planetary sensor data. Blackwell's 57 TB/s rack bandwidth allows for sub-kilometer resolution weather forecasting, providing early warning for natural disasters with unprecedented precision.

XI. Competition: Blackwell vs MI300X vs Gaudi 3

Is Blackwell a monopoly? While AMD and Intel have released incredible hardware, the Blackwell ecosystem remains the gold standard for "Platform Cohesion."

The **AMD MI300X** features more HBM capacity (192GB) than the H100 but lacks the integrated **Transformer Engine 2.0** and the 10 TB/s inter-die bridge of Blackwell. The **Intel Gaudi 3** offers incredible price-to-performance for training but struggles to match NVIDIA's **FP4 inference throughput**. NVIDIA's secret weapon isn't just the chip; it's the 15-year head start of the **CUDA software stack** which translates Blackwell's raw TFLOPS into actual production value for enterprises.

Blackwell Capacity Audit.

Project your training and inference throughput for the B100/B200 generation based on your specific model architecture. Get your technical forecast today.

XIII. The Path to AGI: Why Blackwell is the "Singularity Silicon"

Many industry experts believe that **Artificial General Intelligence (AGI)** requires a compute leap that air-cooled, single-die GPUs could never provide. Blackwell represents the first stage of what NVIDIA calls the "Compute Singularity"—the moment when the silicon fabric itself becomes the bottleneck rather than the algorithm.

By unifying 72 GPUs into a single memory domain (NVL72), Blackwell moves us from "Multi-Node Compute" to "System-Wide Intelligence." In this model, the entire data center center rack functions as a single brain. This is critical for **Continuous Learning** systems, where a model must update its weights in real-time as it interacts with the world. The 10 TB/s C2C bridge and SHARP v4 reductions provide the necessary "reflex speed" for such a system to maintain coherence across trillions of parameters.

Memory Coherence

Atomic operations across all 72 GPUs allow for decentralized weight updates without consistency lag.

Sparse Training

Blackwell's native support for structured sparsity allows for "Mixture of Experts" models to skip 50% of the math.

Auto-Recovery

The Blackwell RAS system ensures that AGI-scale training sessions (lasting months) can survive individual chip defects.

XII. The Blackwell Encyclopedia (The Ultimate Reference Guide)

B200 The flagship Blackwell GPU; 208B transistors, dual-die configuration with 10 TB/s C2C bridge.

BlueField-3 The DPU of the Blackwell generation, handling high-speed cluster networking and storage offload at line-rate.

CDU (Cooling Distribution Unit) The centralized pump system that manages chilled water flow in the GB200 NVL72 rack.

CoWoS-L Advanced packaging variant for "Large" multi-die interposers, enabling the dual-die B200 architecture.

DCQCN Data Center Quantized Congestion Notification; critical for managing traffic priority in NVLink-Network fabrics.

De-processing Hardware logic for stripping unneeded precision during AI inference, managed by the Transformer Engine.

Double-Reticle A chip design that occupies the maximum possible space on a silicon wafer scanner (e.g., Blackwell).

Dragonfly Topology Inter-rack network design that reduces hop count and expensive optical cabling in ultra-scale clusters.

E-Mode Energy conservation mode for Blackwell GPUs during idle states, reducing leakage current by 40%.

FP4 New 4-bit floating point precision standard delivering 20 PFLOPS capacity with dynamic microscaling.

GB200 The Grace-Blackwell superchip pairing a Grace CPU with two B200 GPUs over a coherent link.

HBM3e Extended 4.8 TB/s memory stacks providing 192GB VRAM per B200 module, the highest in the SXM segment.

In-Network Computing The ability of switch hardware (SHARP v4) to perform data reductions without involving the GPU compute cores.

LCC (Liquid-to-Liquid CDU) Specialized heat exchanger that transfers heat from the rack-level secondary loop to the facility master loop.

LDPC (Low-Density Parity Check) Advanced error correction for the ultra-fast NVLink 5 data streams.

Manifold The distribution piping in the GB200 rack that delivers coolant to each GPU cold plate.

MCM Multi-Chip Module; the dual-die packaging strategy that defines the Blackwell generation.

Microscaling The mathematical technique of applying unique scale factors to small 8-element blocks of 4-bit weights.

NVLink 5 5th Gen high-speed link delivering 1.8 TB/s per GPU, the primary fabric of the AI data center.

NVL72 The full 72-GPU rack configuration that functions as a single 13.5TB supercomputer.

PAM4 Pulse Amplitude Modulation 4-level; the signaling method used to drive 224Gbps over Blackwell copper links.

Post-Training Quantization Hardware-accelerated weight compression specifically for the Blackwell FP4 engines.

PUE (Power Usage Effectiveness) A measure of data center efficiency; Blackwell racks aim for sub-1.1 PUE via liquid cooling.

Reticle Limit The physical maximum size of a single silicon die (limit-sized dies in the B200).

SDE (Silent Data Error) Errors in computation that go undetected by simple parity; Blackwell hardware has specific logic for SDE forensics.

SerDes Serializer/Deserializer; the circuits that move data at 224Gbps over Blackwell copper trays.

SHARP v4 Network protocol for offloading math operations directly to the fabric switch hardware.

Speculative Decoding Inference technique that uses a small model to "guess" tokens, verified by a larger model on Blackwell hardware.

TSV (Through-Silicon Via) Vertical electrical connections that pass through the silicon wafer, critical for HBM3e stack integration.

Transformer Engine 2.0 Software/Hardware stack that autonomously manages precision and sparsity for LLMs.

Yield Rate The percentage of functioning silicon chips per manufactured wafer at TSMC, critical for B200 supply.

Z-Height The physical height of the HBM3e stack; Blackwell's 8-high and 12-high stacks are pushing the limits of the B200 package envelope.

Blackwell.
Limitless.

Blackwell: The Silicon Singularity