Fabric Efficiency & MTU Modeler
Analyze the mathematical overhead of VXLAN encapsulation. Simulate fragmentation risks and visualize the MTU ladder for your underlay/overlay.
Simulation Params
Packet Overhead Analysis
For TCP traffic traversing EVPN-VXLAN, the **MSS (Maximum Segment Size)** must be reduced to account for the encapsulation. If the absolute path MTU is 1500 bytes, the VXLAN overhead (typically 50 bytes) dictates a maximum IP payload of 1450 bytes. Subtracting the internal IPv4 and TCP headers (40 bytes), the ideal MSS should be set to **1410 bytes** to prevent performance-killing ICMP "Fragmentation Needed" events.
In modern Leaf-Spine AI fabrics, implementing **Jumbo Frames (9000-9216 bytes)** on the underlay is mandatory. This provides sufficient "headroom" for nested encapsulation, multi-level VLAN tagging, and security headers while still allowing the standard 1500-byte client Ethernet frame to pass without fragmentation, significantly reducing CPU interrupts at the VTEP (Virtual Tunnel Endpoint).
1. The Encapsulation Equation: The VXLAN Byte Tax
VXLAN (Virtual eXtensible Local Area Network) encapsulates Layer 2 frames into Layer 3 UDP packets. This allows Ethernet segments to span across a routed L3 underlay.
Packet Overhead Calculus
The result is a 50-byte tax for IPv4 (74 for IPv6). If your underlay is restricted to a standard 1500-byte MTU, any 1500-byte guest frame will be fragmented into two packets, effectively doubling your packet-per-second (PPS) count and potentially crushing the destination CPU during reassembly.
2. IRB Architecture: Symmetric vs. Asymmetric
Integrated Routing and Bridging (IRB) defines how traffic moves between VNIs. Choosing the wrong model is the #1 cause of control-plane state bloat.
Symmetric IRB
Routing occurs at both source and destination VTEPs into a dedicated Transit VNI. High scalability—Leafs only need local VLAN configuration.
Asymmetric IRB
Ingress Leaf routes; egress Leaf only bridges. Requires every Leaf to carry state for EVERY VNI. Not recommended for fabrics larger than 10-15 nodes.
3. Route Type Forensics: The MP-BGP Core
EVPN differs from legacy VXLAN by using MP-BGP to advertise reachability. Understanding the five primary Route Types (RFC 7432) is critical for troubleshooting convergence.
Type-2: MAC/IP
The primary route for host reachability. Advertises both MAC and IP to enable ARP suppression at remote Leaf switches.
Type-1/4: ESI Logic
Ethernet Segment Identifiers enable multi-homing. Type-1 handles aliasing (ECMP), and Type-4 handles Designated Forwarder (DF) election.
4. Industrial Blueprint: Zero-Fragmentation Fabrics
Building a hyperscale fabric requires rigid adherence to MTU and QoS standards. This is the Gold Standard for AI and Public Cloud infrastructure.
Universal 9216B MTU
Enabled across all physical Spine and Leaf interfaces. Eliminates the '50-byte trap' and allows for stacked NSH/Geneve headers.
Symmetric IRB Gateway
Uses Transit VNIs (L3VNI) for all inter-subnet traffic. Minimizes the required MAC-table size in hardware ASICs.
DSCP-to-Outer QoS
Copy internal RoCEv2 markings to the outer IP header. Ensures Spines respect lossless priority queues during congestion.
Frequently Asked Questions
Technical Standards & References
Related Engineering Resources
EVPN Type-2/Type-5 Scaling: MAC/IP Advertisement and the Route Reflector Fan-Out Problem
EVPN defines multiple route types for different advertisement purposes: Type-2 (MAC/IP Advertisement) routes carry host MAC addresses and their associated IP addresses and VNI; Type-5 (IP Prefix) routes advertise L3 VPN prefixes across the fabric. The scaling challenge of EVPN lies in the number of Type-2 routes: in a data center with 50,000 VMs, each VM may have one MAC and one IP, generating 50,000 Type-2 routes per VNI. With 4000 VNIs, the total route count reaches 200 million—impossible to store in a BGP RIB of any practical size. The EVPN "Massively Scalable Data Center" (MSDC) architecture addresses this by deploying "MAC-less" designs where the physical switches learn only gateway router MACs, and the individual VM MACs are stored in the hypervisor's virtual switch. This reduces the EVPN Type-2 route count from 200 million to approximately 200,000 (one per physical ToR switch), reducing the BGP RIB requirement by three orders of magnitude.
The route reflector (RR) fan-out ratio determines the control plane convergence time. With a single RR handling 200 leaf switches, each leaf establishes an iBGP session to the RR, which reflects all EVPN routes to all other leaves. Each leaf must install all received routes into its BGP RIB for best-path selection, consuming approximately 400 bytes per route in memory plus TCAM entries for the installed FIB. A fabric with 200,000 routes consumes 80 MB of RIB memory per leaf and 16 GB of memory on the RR (which stores 200 copies, one per leaf adjacency, in a typical implementation). Route refresh and EOR (End-of-RIB) marking during RR restart or session reset can generate 200 × 200,000 = 40 million update messages, causing CPU exhaustion on the RR for up to 30 minutes. Route server aggregation and BGP ADD-PATH reduce this by enabling leaves to advertise only the best path per prefix to the RR, reducing the RR's memory requirement to 200,000 routes × 400 bytes = 80 MB.
The Type-5 route processing overhead is dominated by the IP-VRF route target (RT) filtering complexity. Each Type-5 route carries an RT extended community that determines which VRFs should import the route. A typical data center with 200 tenants and 20 VRFs per tenant has 4000 RTs. Each BGP update must be filtered against the 4000 RT import policies on each VRF, requiring O(N_RT × N_routes) comparisons per update. On an Arista 7280R3 with 200,000 IPv4 routes and 4000 RTs, each RT match is a prefix hash lookup in O(1) time, but the RT list processing requires iterating through the 4000 RTs to build the set of matching VRFs, costing 4000 × lookups per route. At a BGP update rate of 10,000 routes per second, the route processing pipeline consumes 40 million RT lookups per second—saturating a single CPU core and limiting the control plane convergence speed. Our analyzer models this using a closed-form queuing formula: T_convergence = N_routes × (T_lookup × N_RT + T_FIB_install), and reports the bottleneck in the BGP update pipeline for the specified fabric size.
VXLAN Encapsulation Offload: SmartNIC and Switch ASIC Pipeline Acceleration
VXLAN encapsulation overhead — the 50-byte outer header (Ethernet + IP + UDP + VXLAN) that must be prepended to every frame — is traditionally handled by the software data plane running on the host CPU or by the ToR switch hardware ASIC. However, at line rates exceeding 400 Gbps, software-based VXLAN encapsulation becomes a bottleneck that consumes multiple CPU cores per 100 Gbps of throughput. The industry response has been to offload VXLAN encapsulation to specialized hardware: SmartNICs (NVIDIA BlueField, Intel IPU) that perform the full encapsulation/decapsulation pipeline in the NIC ASIC, and programmable switch ASICs (Intel Tofino, Broadcome Jericho) that perform VXLAN tunnel endpoint (VTEP) functions in the underlay switching fabric. Each offload approach has distinct performance, programmability, and cost trade-offs that directly impact the scalability and latency of EVPN-VXLAN fabrics.
The SmartNIC VXLAN offload pipeline operates as follows: when the host transmits a guest frame on a VXLAN-enabled virtual interface, the NIC hardware intercepts the frame at the PCIe DMA stage, looks up the destination VNI and MAC in an on-NIC forwarding table (populated by the host's virtual switch via the VXLAN offload interface), and executes the full encapsulation — adding the outer Ethernet header with the remote VTEP MAC, the outer IP header with the remote VTEP IP, the UDP header with the VXLAN destination port (4789), and the VXLAN header with the 24-bit VNI. The NIC then applies the outer checksum (UDP checksum for IPv4, required for RoCEv2), computes the Ethernet FCS, and transmits the encapsulated frame on the wire. The entire encapsulation pipeline executes in hardware, consuming approximately 20-40 nanoseconds per 1500-byte frame — compared to 2-5 microseconds for software-based encapsulation using DPDK or OVS-DPDK. This 50-100× reduction in per-packet processing latency eliminates the CPU overhead of VXLAN tunnel termination, freeing host CPU cycles for application workloads. However, the SmartNIC's offload table size is limited: BlueField-3 supports up to 4,096 VXLAN offload entries (VNI + destination VTEP combinations), which is sufficient for a typical leaf switch serving 32-128 rack endpoints but falls short for a spine switch that may need 16,384+ offload entries for a large multi-tenant fabric. When the offload table overflows, the NIC falls back to software encapsulation on the host's ARM cores (BlueField's internal ARM complex), introducing a 5-10× latency penalty for the overflow entries.
The switch ASIC VXLAN offload pipeline pushes the VTEP function into the network fabric. A VXLAN-capable switch ASIC (Broadcom Tomahawk 5 or Jericho 3) provides hardware VXLAN tunnel termination: when a VXLAN-encapsulated frame arrives at a switch port, the ASIC strips the outer headers, looks up the inner destination MAC in the VRF-specific MAC table associated with the VNI, and performs the forwarding decision (bridge or route) on the inner frame. The switch ASIC must also perform VXLAN decapsulation on the egress VTEP side, adding the outer headers back before transmitting the frame to the destination host. This is the hardware VTEP architecture used in white-box and merchant-silicon EVPN-VXLAN solutions. The key performance constraint is the ASIC's VXLAN tunnel lookup rate: each VXLAN-encapsulated frame requires two TCAM lookups (one for the outer IP destination VNI mapping, one for the inner MAC forwarding decision), consuming twice the TCAM bandwidth of a non-tunneled frame. At 800 Gbps line rate (approximately 66 million packets per second for 1500-byte MTU), the TCAM lookup bandwidth requirement is 132 million lookups per second — close to the limit of current generation TCAM (Tomahawk 5 supports 160 million lookups/sec). This means that VXLAN encapsulation overhead reduces the effective packet processing capacity of the switch ASIC by up to 40% compared to native Ethernet forwarding, limiting the real-world throughput of VXLAN fabrics to approximately 60% of the switch's raw port bandwidth under worst-case packet sizes.
The VXLAN-GPE (Generic Protocol Extension) and Geneve encapsulation standards extend the basic VXLAN header to support protocol chaining and variable-length metadata options, enabling network virtualization overlays that transport non-Ethernet payloads (e.g., NVMe-over-Fabrics frames, service function chaining headers). These extended encapsulation headers impose additional TCAM cost: each 8-byte option requires an additional TCAM entry for the lookup, and Geneve's variable-length option headers can fragment the switch ASIC's TCAM pipeline if the maximum option length exceeds the ASIC's fixed-width TCAM slot. The VXLAN encapsulation overhead analyzer includes an Offload Architecture Comparison mode where the user selects between SmartNIC-based, switch-ASIC-based, or hybrid (SmartNIC for leaf VTEP, switch ASIC for spine VTEP) offload architectures. The model reports the per-packet latency, TCAM utilization percentage, CPU core savings, and the maximum number of VXLAN tunnels supported at line rate for the selected architecture, enabling a data-driven choice of VXLAN offload strategy for the specific fabric scale and workload mix.
MP-BGP EVPN Route Target Constraint Optimization and Control Plane Convergence Scaling
In EVPN-VXLAN fabrics, the BGP route target (RT) extended community is the primary mechanism for controlling which VRFs import which EVPN routes. Each VRF is configured with an import RT list (the RT values that cause the VRF to accept a received EVPN route) and an export RT list (the RT values that the VRF attaches to its locally originated EVPN routes). The RT constraint (RFC 4684) optimization — also known as "RT filtering" — propagates the import RT lists between route reflectors (RRs) so that only routes with RT values that match at least one VRF's import RT are transmitted across the BGP session. Without RT constraint, the RR transmits all EVPN routes from all VRFs to every leaf switch, regardless of whether the leaf has a VRF that imports any of those RTs. In a multi-tenant fabric with 200 tenants, 20 VRFs per tenant, and 200 leaf switches, the RR transmits each EVPN route to all 200 leaves — even though each leaf only has 20-200 VRFs and imports only 1/N of the routes. RT constraint reduces the per-session route transmission by a factor of the average VRF count per leaf divided by the total VRF count in the fabric, which in this example is 20 VRFs per leaf / 4000 VRFs = 0.005 — a 99.5% reduction in BGP update load.
The RT constraint convergence time during a VRF add/remove event is the critical control plane scaling metric. When a new VRF is created on a leaf switch with import RT = 65000:100, the leaf must propagate an RT membership NLRI (Network Layer Reachability Information) to its RR, advertising that it now imports RT 65000:100. The RR must then re-evaluate all received EVPN routes against the updated RT membership set: any previously filtered route that carries RT 65000:100 must now be transmitted to the leaf. The number of routes that match the new RT depends on how many tenants share that RT. In a shared-services model where all tenants import a common "default-route" RT (e.g., RT 65000:0 representing the fabric gateway default route), a single VRF creation triggers the re-transmission of the default route to the new leaf, and possibly to existing leaves that also import RT 65000:0 if the new VRF exports a new route. The BGP update processing load on the RR during this re-evaluation is O(N_routes_per_RT x N_peers), where N_routes_per_RT is the number of routes carrying the newly imported RT and N_peers is the number of leaf switches. For a shared-services RT with 1,000 routes and 200 leaves, the RR must process 200,000 BGP updates during the convergence window — a load that can take 2-10 seconds on a commodity control plane CPU.
The RT constraint scalability limit is reached when the total number of distinct RT values in the fabric exceeds the RR's RT membership NLRI processing capacity. Each leaf switch must maintain an RT membership NLRI for each RT it imports, and the RR must store and process all RT membership NLRIs from all leaves. For a fabric with 4000 VRFs and an average of 3 import RTs per VRF, the total number of RT membership NLRIs is 4000 x 3 = 12,000. Each RT membership NLRI consumes approximately 100 bytes in BGP RIB memory, totaling 1.2 MB — negligible. However, when each VRF removal triggers a membership withdrawal, and each addition triggers a membership advertisement, the update rate can reach 100 RT changes per second during large-scale tenant provisioning (e.g., spinning up 500 new tenants across a multi-tenant cloud fabric). At 100 RT changes per second, each triggering a re-evaluation of the matching routes, the RR's CPU utilization for route processing can exceed 80%, causing BGP keepalive timers to expire and sessions to reset — a control plane avalanche scenario. Our EVPN-VXLAN analyzer includes an RT Constraint Scalability Modeler that accepts the number of tenants, VRFs per tenant, shared-services RTs, routes per RT, and expected tenant provisioning rate, and computes the RR CPU utilization, the control plane convergence time, and the maximum safe VRF churn rate before the BGP sessions become unstable. The modeler recommends mitigation strategies including RT aggregation (collapsing multiple RTs into a smaller set of tenant-group RTs), RT constraint delegation to dedicated route server instances, and BGP update pacing with configurable inter-update delay (typically 10-50 ms between consecutive RT changes).
"You are our partner in accuracy. If you spot a discrepancy in calculations, a technical typo, or have a field insight to share, don't hesitate to reach out. Your expertise helps us maintain the highest standards of reliability."
Contributors are acknowledged in our technical updates.
