PingDo Logo
PingDo.net
by Pingdo
AI Infrastructure Hub

AI Fabrics
& GPU Clusters

The network is the computer. Deconstructing the forensics of RoCE v2, InfiniBand NDR, RDMA hydraulics, and the non-blocking topologies required for hyperscale bisection bandwidth.

45
Total Resources
45
Deep Guides
3
Lab Tools
45
Engineering
BACK TO NETWORK HUB

GPU Fabric & RoCE

42 articles

InfiniBand vs RoCE v2 & RDMA Mechanics

View Full Library
Ai-infrastructureai-infrastructureadaptive-routing

Adaptive Routing vs. Ethernet ECMP: Forensic AI Fabric Balancing

Scaling GPU clusters without the bottleneck of static hashing. Comparing InfiniBand dynamic routing vs standard Ethernet ECMP at 800G/1.6T scales.

ai-infrastructure Read article
Ai-infrastructureai-infrastructureai-networking-pillar

AI Networking Infrastructure: The GPU-Centric Fabric

Master the architecture of AI networking clusters. Deconstructing RoCE v2, InfiniBand vs. Ethernet, and the engineering of non-blocking fabrics for LLM training.

ai-infrastructure Read article
Ai-infrastructureai-infrastructureai-predictive-maintenance

AI-Driven Predictive Maintenance

Discover how AI and Machine Learning are transforming network maintenance from reactive logic to proactive, self-healing architectures.

ai-infrastructure Read article
Ai-infrastructureai-infrastructureall-optical-switches

Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network

Scaling GPU fabrics without the power tax of electronic switching. How Optical Circuit Switching (OCS) is defining the next generation of AI clusters.

ai-infrastructure Read article
Ai-infrastructureai-infrastructureblackwell

NVIDIA Blackwell B200: Architecture of the 20 PFLOPS GPU

Inside NVIDIA Blackwell: Engineering analysis of the B200 dual-die architecture, 5th Gen NVLink, FP4 precision, and the transition to the 72-GPU liquid-cooled rack.

ai-infrastructure Read article
Ai-infrastructureai-infrastructurecpo-co-packaged-optics-future

CPO Technology Roadmap for AI Supercomputing

Solving the chip-to-fiber bottleneck: CPO vs LPO vs Plugable transceivers for next-gen AI fabrics.

ai-infrastructure Read article
Ai-infrastructureai-infrastructuredistributed-training-mechanics

Bringing HPC Techniques to Deep Learning

An engineering guide to collective communication in AI clusters. Understanding NCCL, All-Reduce algorithms, and the communication wall in distributed LLM training.

ai-infrastructure Read article
Ai-infrastructureai-infrastructuredpu-performance-offload

BlueField DPU Architecture and Programming Guide

Reclaiming CPU cycles for AI: How DPUs manage storage, security, and networking in modern data centers.

ai-infrastructure Read article
Ai-infrastructureai-infrastructureedge-ai

Edge AI: Benchmarking ARM vs. RISC-V for Inference (2026)

Deep dive into decentralized AI architecture. Comparing ARM Cortex-X5, RISC-V vectors, and INT4 quantization for private local inference.

ai-infrastructure Read article
Ai-infrastructureai-infrastructurefat-tree-vs-dragonfly

A Study of Non-blocking Switching Networks

Scaling GPU fabrics: Why Fat-Tree dominates AI while Dragonfly promises cost reduction.

ai-infrastructure Read article
Ai-infrastructureai-infrastructureflash-attention

FlashAttention Deep Dive: Breaking the Quadratic Memory Ceiling

A technical exploration of IO-aware attention mechanisms and SRAM tiling.

ai-infrastructure Read article
Ai-infrastructureai-infrastructuregpu-fabric-design

A Study of Non-Blocking Switching Networks

Advanced data center topologies for AI workloads.

ai-infrastructure Read article
Ai-infrastructureai-infrastructuregpu-performance-benchmarks

GPU Performance Benchmarks: The AI Infrastructure Audit

Cross-generational engineering analysis of H100, H200, and Blackwell TFLOPS vs. HBM bandwidth. Understanding the metrics that define AI cluster performance.

ai-infrastructure Read article
Ai-infrastructureai-infrastructuregpu-power-cooling

NVIDIA GB200 Liquid Cooling Design Guide v1.0

Scaling high-density AI infrastructure: How to dissipate 120kW per rack using Direct Liquid Cooling (DLC).

ai-infrastructure Read article
Ai-infrastructureai-infrastructureh100

NVIDIA H100 vs H200: The HBM3e Memory Revolution

Comparative engineering analysis of H100 (Hopper) vs H200: Impact of HBM3e bandwidth on LLM inference, training throughput, and memory-bound scaling.

ai-infrastructure Read article
Ai-infrastructureai-infrastructurehbm3e

HBM3e GPU Memory Deep Dive: The 8TB/s Bandwidth Wall | Pingdo

A technical forensics guide to bisection bandwidth, load balancing, and cabling economics in 100,000-GPU AI clusters.

ai-infrastructure Read article
Ai-infrastructureai-infrastructureethernet

Ethernet 1.6T: The AI Networking Masterclass (2026)

Deep dive into 800G and 1.6T Ethernet, 224G SerDes, and the Ultra Ethernet Consortium (UEC). Learn why Ethernet is finally matching InfiniBand for AI.

ai-infrastructure Read article
Ai-infrastructureai-infrastructurefp8-vs-bf16-vs-int8

FP8 vs BF16 vs INT8: The Forensic Guide to AI Precision Scaling

Forensic analysis of AI precision formats. Comparative study of FP8 (E4M3/E5M2), BF16, and INT8 stability and throughput.

ai-infrastructure Read article
Ai-infrastructureai-infrastructuregpudirect-rdma

GPUDirect RDMA: The Blueprint for Zero-Copy Networking

A technical forensics guide to bisection bandwidth, load balancing, and cabling economics in 100,000-GPU AI clusters.

ai-infrastructure Read article
Ai-infrastructureai-infrastructuregpudirect-storage-gds

GPUDirect Storage (GDS): Accelerating Data-Path Performance | Pingdo AI Infrastructure

Analyzing NVIDIA GPUDirect Storage (GDS) and its impact on large-model training and checkpointing. Removing the CPU/DRAM bounce-buffer to unleash 100GB/s+ storage paths.

ai-infrastructure Read article
Ai-infrastructureai-infrastructurehow-ecn-prevents-buffer-bloat

The Addition of Explicit Congestion Notification (ECN) to IP

Explicit Congestion Notification: The first line of defense in high-performance GPU networking.

ai-infrastructure Read article
Ai-infrastructureai-infrastructurehow-nccl-works

NVIDIA Collective Communications Library (NCCL) Programming Guide

Scaling GPU communication: How NCCL uses Ring, Tree, and NVLink to maximize bandwidth for AI training.

ai-infrastructure Read article
Ai-infrastructureai-infrastructurendr800

NVIDIA NDR800 vs. NDR400: Scaling 800G AI Fabrics

Deep dive into 800G InfiniBand architecture. Comparing Quantum-3 vs. Quantum-2, 224G SerDes, and the mechanical limits of AI cluster scaling.

ai-infrastructure Read article
Ai-infrastructureai-infrastructurexdr

InfiniBand XDR: The Physics of Zero-Jitter AI Fabrics (2026)

Exploring InfiniBand XDR (800G) and the GDR roadmap. SHARP v4, Dragonfly+ topologies, and why IB remains the gold standard for LLM training.

ai-infrastructure Read article
Ai-infrastructureai-infrastructureib-subnet-manager-role

InfiniBand Architecture Specification Volume 1

Scaling InfiniBand: Why the Subnet Manager is the brain of the AI infrastructure.

ai-infrastructure Read article
Ai-infrastructureai-infrastructuremtu-9000-jumbo-frames

Ethernet Frame Size and its Impact on Network Throughput

Scaling network throughput by increasing packet size. Why AI environments cannot survive on the standard 1500B MTU.

ai-infrastructure Read article
Ai-infrastructureai-infrastructurenpu

Mobile NPU Optimization: Squeezing LLMs into 8GB (2026)

Advanced guide to Apple A19 Neural Engine, Qualcomm Hexagon, and Google Tensor G6. Learn about KV-cache sharding and ExecuTorch optimization.

ai-infrastructure Read article
Ai-infrastructureai-infrastructurenvlink-vs-nvswitch

NVLink vs. NVSwitch: Scaling the Intra-Node Fabric | Pingdo AI Infrastructure

Deep dive into NVIDIA's memory fabric. Analyzing the bandwidth, topology, and scale-up limits of NVLink 4.0 and NVSwitch systems for LLM training.

ai-infrastructure Read article
Ai-infrastructureai-infrastructurenvme-over-fabrics-optimization

NVM Express Base Specification Revision 2.0

Scaling high-speed storage for AI: How NVMe-oF uses RDMA to deliver millions of IOPS over the fabric.

ai-infrastructure Read article
Ai-infrastructureai-infrastructure

AI Infrastructure Hub: GPU Networking & Cluster Architecture

Deep engineering resources for architecting AI training fabrics: RoCE v2 vs. InfiniBand, non-blocking topologies, and 800G GPU interconnects.

ai-infrastructure Read article
Ai-infrastructureai-infrastructureparallel-file-systems-lustre-beegfs

Parallel File Systems: Lustre, BeeGFS & GPFS | Pingdo AI Infrastructure

Deep dive into Parallel File Systems for AI cluster scaling. Analyzing the architecture of Lustre, BeeGFS, and IBM Storage Scale (GPFS) for multi-petabyte datasets.

ai-infrastructure Read article
Ai-infrastructureai-infrastructureparallelism-networking-impact

Efficient Large-Scale Language Model Training on GPU Clusters

Scaling LLM training: How sharding models and data across thousands of GPUs changes the demands on the network fabric.

ai-infrastructure Read article
Ai-infrastructureai-infrastructurepcie-6-7-ai-accelerators

PCIe Gen6 & 7: The Host Interface for AI | Pingdo AI Infrastructure

Analyzing the transition to PCIe Gen6 (PAM4) and PCIe Gen7 for AI accelerators. Comparing bandwidth, power profiles, and IOPS requirements for dense GPU servers.

ai-infrastructure Read article
Ai-infrastructureai-infrastructurepfc-vs-ets-lossless-ethernet

IEEE Std 802.1Qbb: Priority-based Flow Control

Understanding the balance between flow control and fair scheduling in RDMA fabrics.

ai-infrastructure Read article
Ai-infrastructureai-infrastructurepower-quality-ai-latency

Thermal and Electrical Load Balancing in Multi-GPU Racks

Moving past PUE: How voltage stability and VRM response times impact high-frequency AI inference.

ai-infrastructure Read article
Ai-infrastructureai-infrastructurerail-optimized-gpu-networking

DGX H100 System Architecture and Network Optimization

Scaling distributed training across thousands of GPUs: Strategy for rail alignment in AI network fabrics.

ai-infrastructure Read article
Ai-infrastructureai-infrastructurerdma-performance-optimization

RDMA Performance Optimization: Tuning for AI | Pingdo AI Infrastructure

Deep dive into RDMA (Remote Direct Memory Access) tuning for RoCE (RDMA over Converged Ethernet) and InfiniBand. Optimizing queue depth, adaptive routing, and buffer management.

ai-infrastructure Read article
Ai-infrastructureai-infrastructureroce-v2-header-overhead

Supplement to InfiniBand Architecture Specification: RoCE v2

A deep dive into the protocol efficiency of RDMA over Converged Ethernet vs. InfiniBand.

ai-infrastructure Read article
Ai-infrastructureai-infrastructureroce-vs-infiniband-deep-dive

InfiniBand Architecture Specification Volume 1

Engineering guide to InfiniBand Architecture Specification Volume 1.

ai-infrastructure Read article
Ai-infrastructureai-infrastructurestorage-networking-ai

Storage Infrastructure for AI | GPUDirect & NVMe-oF Deep Dive

Solve the AI IO wall. Engineering guide to GPUDirect Storage (GDS), NVMe-over-Fabrics, and high-performance checkpointing for GPU clusters.

ai-infrastructure Read article
Ai-infrastructureai-infrastructureucie-chiplet-interconnect

Universal Chiplet Interconnect Express (UCIe) 1.1 Specification

Standardizing die-to-die communication: How UCIe is breaking the reticle limit for the next generation of AI compute.

ai-infrastructure Read article
Ai-infrastructureai-infrastructureultra-ethernet-consortium-impact

Ultra Ethernet Consortium: A Scalable Transport for AI and HPC

Solving the scale problem: How UEC is building a high-performance transport layer for the world's largest GPU clusters.

ai-infrastructure Read article
Knowledge Ecosystem

Explore Specialized Engineering Hubs

Deep-dive into dedicated listing pages for every major networking discipline, optimized for professional reference and architectural planning.

GPU Fabric & RoCE

InfiniBand vs RoCE v2 & RDMA Mechanics

Enter Hub

Cluster Topology

Rail-Optimized Fat-Tree & Non-Blocking Fabrics

Enter Hub

800G & Optics

OSFP/QSFP112, PAM4 & Bit Error Rate (BER) Logic

Enter Hub

Collective Comms

All-Reduce, NCCL/RCCL & Gradient Synchronization

Enter Hub

Training Dynamics

Scaling Laws, MoE & Synthetic Datasets

Enter Hub