Why does RoCE v2 use UDP port 4791?

UDP port 4791 is the IANA-standard port for RoCE v2. By using a standard UDP encapsulation, RoCE packets can traverse standard L3 IP routers. This allows switches to use the source UDP port to calculate hashing for ECMP (Equal-Cost Multi-Pathing), which ensures that RDMA flows are balanced across all available physical links.

What is Invariant CRC (ICRC)?

ICRC is a 32-bit trailer added to RoCE v2 packets to ensure data integrity across the fabric. Unlike the standard Ethernet FCS, which is recalculated at every hop, the ICRC covers only the parts of the packet that do not change (invariant). This prevents 'silent bit flips'—errors that occur inside a switch's memory and get 're-blessed' with a valid FCS upon egress.

How does the MTU impact RoCE efficiency?

Since the RoCE v2 header stack is fixed (approx 54-70 bytes), a larger MTU (like 4096 or 9000 bytes) dilutes that overhead over a larger payload. Using the standard 1500-byte Ethernet MTU results in a ~3.6% bandwidth loss due to framing; using a 4KB MTU reduces this loss to ~1.3%.

What is the OpCode Tax in RDMA?

Different RDMA operations require different headers. A simple READ or WRITE over a 'Reliable Connection' (RC) adds an Extended Transport Header (RETH) for memory keys and virtual addresses, which adds 16 bytes. If you use 'Immediate Data,' another 4 bytes are added. The 'Overhead' is dynamic based on the operation.

Why do we need 32-bit alignment in RDMA?

The InfiniBand specification (which RoCE v2 is based on) requires that all payloads be a multiple of 4 bytes (32-bit). If you send a message of 1001 bytes, the NIC will add 3 'Padding' bytes to reach 1004. These bytes are invisible to the application but consume physical bandwidth.

BACK TO TOOLKIT

RoCE v2 Overhead & Goodput Modeler

A precision simulator for RDMA framing economics. Deconstruct the byte-level cost of RDMA READ/WRITE operations and calculate effective cluster bandwidth.

Configuration

IP Version

MTU Size1500 bytes

Payload Size4 KB

VLAN Tag

Total Overhead

66B

Headers + Trailers per packet.

Wire Efficiency

70.69%

Payload / Total wire size.

vs Native IB

-0.6%

Efficiency gap vs InfiniBand.

RoCE v2 Header Stack

Byte-by-byte breakdown for IPV4 encapsulation.

Ethernet

14B

IP Header

20B

UDP

IB BTH

12B

DDP RDE

Payload

1434B max

iCRC

FCS

RoCE v2 Wire

Packets Needed:3

Total Wire Size:5.66 KB

Overhead Per Packet:66B

Native InfiniBand

Headers Only:~18B

Efficiency:71.28%

Advantage:+0.6%

"Jumbo Frames (MTU 9000) significantly reduce the overhead ratio for RoCE v2 workloads."

1. The 54-Byte Tax: Encapsulation Physics

In a standard Ethernet environment, overhead is predictable. RoCE v2, however, stacks multiple protocol layers to achieve routability across Leaf-Spine topologies. This stacking adds a significant number of bytes to every packet.

Header Stack Breakdown

\text{RoCE}_{overhead} = \text{Eth}(18B) + \text{IP}(20B) + \text{UDP}(8B) + \text{BTH}(12B) = 58B

Ethernet (VLAN) | IP V4/V6 | UDP (4791) | IB BTH

While 58 bytes seems trivial, for a 4KB MTU, it represents a **1.4% Bandwidth Leak**. If your cluster uses a standard 1.5KB MTU, this tax jumps to **3.8%**. In a cluster with $200 Million in H100 GPU capital, a 3.8% bandwidth leak is equivalent to **$7.6 Million of "Stranded" Network capacity**.

2. Header Forensics: The Dynamic OpCode Tax

The 58-byte fixed overhead is only the starting line. Depending on the RDMA operation and reliability level, the NIC appends Extended Transport Headers (ETH).

RETH Extension (+16B)

Required for RDMA READ/WRITE operations. Carries the Virtual Address and R_Key (Remote Key). Total overhead now reaches 74 bytes.

Alignment Padding

All RDMA payloads must be 32-bit (4-byte) aligned. Messages of odd sizes (e.g., 2049 bytes) pay a 3-byte 'Padding Tax' which consumes wire bandwidth but is not payload.

3. ICRC: The Guard Against Switch Corruption

Unlike standard Ethernet where the FCS can be recalculated at every hop, RoCE v2 adds an Invariant CRC (ICRC). This is an absolute requirement for bit-level integrity in AI clusters.

Silent Bit-Flips

If a bit flips inside a switch's memory, the switch will calculate a 'valid' Ethernet FCS for the 'corrupted' data upon egress. The ICRC is end-to-end; it fails the packet at the receiver, preventing the model from training on garbage data.

\text{Risk}_{silent} \approx \frac{N_{\text{hops}}}{2^{32}}

Serial Delay Tax

Header size determines serialization latency ( $T_{ser}$ ). On 800Gbps links, the overhead bytes add sub-nanosecond delay, but in a multi-hop CLOS fabric, this jitter can impact sync operations.

T_{ser} = \frac{\text{Overhead Bits}}{800 \times 10^9}

4. Industrial Forensics: Framing Strategies

Optimizing RoCE v2 requires deep knowledge of your switch ASIC's capabilities. Not all fabrics are created equal when handling RDMA.

MTU 4096 (4K)

The InfiniBand standard. Matches standard memory page sizes. Enables 98.7% framing efficiency with standard RoCE headers.

MTU 1500 (Legacy)

The Ethernet lowest-common denominator. Avoid for AI training. The framing tax is too high (>3.5%) for expensive GPU memory buses.

Jumbo 9000 (TCO+)

The most efficient for storage traffic (Weka/Lustre). Dilutes the header tax to <0.8%, but can increase 'Head-of-Line' blocking on compute links.

Frequently Asked Questions

Technical Standards & References

IBTA

InfiniBand Architecture Specification: Volume 1, RoCE v2 Annex

VIEW OFFICIAL SOURCE

NVIDIA Performance Engineering

NVIDIA: RoCE v2 Header Deconstruction & Implementation

VIEW OFFICIAL SOURCE

UEC Committee

Ultra Ethernet Consortium (UEC): Next-Gen Transport Requirements

VIEW OFFICIAL SOURCE

Microsoft Research

Silent Bit Flips in Data Center Fabrics: The ICRC Solution

VIEW OFFICIAL SOURCE

Mathematical models derived from standard engineering protocols. Not for human safety critical systems without redundant validation.

Related Engineering Resources

Interactive Tool

RoCE v2 vs InfiniBand Analyst

Deep dive into RDMA fabric comparisons.

Interactive Tool

Multi-Rail Bandwidth Analyst

Provision the physical rails for AI networking.

Interactive Tool

Packet Loss Impact Tool

Model throughput collapse from single drops.

Interactive Tool

GPUDirect Storage ROI

Model the speedup from kernel-bypass storage.

Partner in Accuracy

"You are our partner in accuracy. If you spot a discrepancy in calculations, a technical typo, or have a field insight to share, don't hesitate to reach out. Your expertise helps us maintain the highest standards of reliability."

Contributors are acknowledged in our technical updates.

RDMA
Physics.

In a Nutshell

RoCE v2 Overhead & Goodput Modeler

Configuration

RoCE v2 Header Stack

1. The 54-Byte Tax: Encapsulation Physics

Header Stack Breakdown

2. Header Forensics: The Dynamic OpCode Tax

RETH Extension (+16B)

Alignment Padding

3. ICRC: The Guard Against Switch Corruption

Silent Bit-Flips

Serial Delay Tax

4. Industrial Forensics: Framing Strategies

MTU 4096 (4K)

MTU 1500 (Legacy)

Jumbo 9000 (TCO+)

Frequently Asked Questions

Technical Standards & References

Related Engineering Resources

RoCE v2 vs InfiniBand Analyst

Multi-Rail Bandwidth Analyst

Packet Loss Impact Tool

GPUDirect Storage ROI