VXLAN & Data Center Overlays

The Problem: VLAN Exhaustion and STP

In a classic data center, Layer 2 networks are horizontal. If you want a Virtual Machine (VM) to move from Rack A to Rack B without changing its IP address, that VLAN must exist in both racks. This leads to "STP sprawl," where large loops are formed, and links are blocked to prevent broadcast storms, wasting 50% of available bandwidth.

The Solution: Layer 2 over Layer 3

VXLAN uses a MAC-in-UDP encapsulation. It takes the original Ethernet frame and wraps it in a UDP packet, an IP header, and a new Ethernet header. This allows the underlay (the physical switches) to route the traffic using OSPF, IS-IS, or BGP, utilizing all physical links via ECMP (Equal-Cost Multi-Path).

VXLAN VTEP Encapsulator

X-Ray view of Layer 2 being wrapped for Layer 3 transport.

Header Stack Growth (+50 Bytes)

Payload

1400B

Inner MAC

14B

Total Latency:1.2ms

L3 Route

L2 Seg

Phase 1: Original Frame

The VM sends a standard Layer 2 Ethernet frame (Internal VLAN).

UDP Port

4789

MTU Required

>= 1550B

ECMP Support

Source UDP Hash

Underlay

L3 Backbone

Key Components: VTEPs and VNIs

VTEP (VXLAN Tunnel Endpoint): The device (usually a switch or server) that performs the encapsulation and de-encapsulation.
VNI (VXLAN Network Identifier): The 24-bit ID that designates which virtual network the traffic belongs to.
The Underlay: The physical L3 network that moves the UDP packets.
The Overlay: The virtual L2 network seen by the servers.

Encapsulation Overhead

Adding these headers increases the packet size by $50\,\text{bytes}$ . Because the standard MTU is $1500\,\text{bytes}$ , using VXLAN without adjustment will cause fragmentation and massive performance drops.

Conclusion

VXLAN is the standard that made the modern cloud possible. It decouples the virtual network from the physical hardware, allowing developers to build complex topologies that can span across an entire data center or even multiple geographical regions.

Engineering Knowledge Expansion

Routing

VXLAN Underlay: BGP-EVPN Control Plane

VXLAN provides the data plane (encapsulation and forwarding), but without a control plane to distribute MAC-to-VTEP mappings, the network is limited to static configuration or flooding-based learning. BGP-EVPN (RFC 7432) is the control plane that makes VXLAN dynamic and scalable to tens of thousands of overlay networks.

In a VXLAN-BGP-EVPN fabric, each VTEP (VXLAN Tunnel Endpoint) runs an MP-BGP session with a Route Reflector. The VTEP advertises its locally learned MAC addresses and IP prefixes as EVPN NLRI (Network Layer Reachability Information) type-2 routes. These routes contain the MAC address, the IP address, the VNI (VXLAN Network Identifier), and — crucially — the VTEP's tunnel endpoint IP. When another VTEP receives this route, it installs the MAC-VTEP mapping into its forwarding table, enabling direct (non-routed) VXLAN tunnels between any two leaf switches.

The key metric of an EVPN control plane is **Route Convergence Time** — how quickly all VTEPs learn about a new endpoint after it appears. At scale, with 10,000+ MAC addresses across 100+ leaf switches, the BGP update churn can become the bottleneck. Modern implementations use **Route Target Constraint (RTC)** filtering, where each leaf only subscribes to EVPN routes for VNIs that have active endpoints on that leaf. This reduces the route table size on each VTEP by up to 90% and prevents the CPU of the leaf switch from being overwhelmed by irrelevant MAC advertisements.

The EVPN control plane also enables **Active-Active Multi-homing** using ESI (Ethernet Segment Identifier) — a technology where a tenant endpoint is dual-attached to two different leaf switches. The leafs use a Designated Forwarder (DF) election process to ensure that only one leaf forwards BUM (Broadcast, Unknown Unicast, Multicast) traffic at a time, preventing packet duplication. This DF election runs in under 50 milliseconds after a leaf failure, providing sub-100ms failover for stateful workloads running over VXLAN overlays.

Multicast Groups and BUM Traffic in VXLAN Fabrics

VXLAN overlays rely on the underlay network to handle Broadcast, Unknown Unicast, and Multicast (BUM) traffic. In the original VXLAN specification (RFC 7348), BUM traffic was handled using IP multicast groups in the underlay. Each VXLAN Network Identifier (VNI) is mapped to a specific multicast group address in the underlay, and VTEPs join these groups using IGMP or PIM-SM to receive broadcast and multicast frames for that VNI.

The multicast-based approach scales poorly beyond a few hundred VNIs because each leaf switch in the fabric must maintain a multicast routing entry for every active VNI. When a spine switch receives a multicast packet, it must replicate the packet to every leaf switch that has joined the multicast group, regardless of whether the destination VTEP has an active endpoint for that particular MAC address. This replication overhead creates a "BUM Tax" that grows linearly with the number of leaf switches in the fabric. In a Clos topology with 256 leaf switches, a single broadcast ARP request from a tenant VM is replicated 256 times across the fabric spine links, consuming bandwidth even in leaf switches that have no active endpoints for that tenant VNI.

Modern VXLAN fabrics have largely moved away from multicast-based BUM handling toward Head-End Replication (HER). In the HER model, the ingress VTEP (the leaf switch connected to the source server) is responsible for replicating the BUM packet to every remote VTEP that has an active endpoint in the same VNI. The underlay network only sees unicast VXLAN-encapsulated packets between VTEPs, eliminating the need for multicast routing in the underlay. The tradeoff is that HER shifts the replication burden from the network switches to the ingress VTEP's CPU and ASIC, which must maintain a list of all remote VTEPs for each VNI and generate a separate encapsulated packet for each destination.

HER performance depends heavily on the VTEP's ability to perform efficient packet replication in hardware. Modern Broadcom Jericho2 and Tomahawk4 ASICs include dedicated replication engines that can generate up to 64 copies of a single packet in a single pass through the pipeline, achieving line-rate replication for fabrics with up to 64 leaf switches. Fabrics exceeding 64 leaf switches require hierarchical replication where the spine switches assist with a second stage of replication. The scaling limit of HER is determined by the product of the BUM packet rate and the number of remote VTEPs per VNI — at 100,000 BUM packets per second with 128 remote VTEPs, the ingress VTEP must generate 12.8 million encapsulated packets per second just for BUM traffic, which can consume up to 40% of a Tomahawk4 ASIC's packet processing capacity.

ARP suppression is the primary optimization technique to reduce BUM traffic in VXLAN fabrics. When a VTEP learns the MAC-to-IP binding of an endpoint (either through data-plane snooping or through the EVPN control plane), it can respond to ARP requests locally without flooding the request to remote VTEPs. The EVPN Type-2 route (MAC/IP Advertisement) carries both the MAC address and the IP address of each endpoint, enabling every VTEP to build a complete ARP table for all active endpoints in each VNI. With full ARP suppression, BUM traffic in a data center VXLAN fabric is typically reduced by 95% or more, with ARP requests representing the dominant source of broadcast traffic in most virtualized environments.

VXLAN Performance: Throughput and Latency Under Encapsulation

VXLAN encapsulation imposes a 50-byte overhead on every packet (14 bytes outer Ethernet + 20 bytes outer IP + 8 bytes outer UDP + 8 bytes VXLAN header). This overhead has direct consequences on effective throughput, path MTU discovery, and CPU utilization in software VTEP implementations. Understanding the performance characteristics of VXLAN is essential for capacity planning and for diagnosing throughput anomalies in overlay networks.

The throughput efficiency of VXLAN can be expressed as the ratio of payload bytes to total wire bytes. For a standard 1500-byte Ethernet frame carrying a TCP payload of 1460 bytes (after IP and TCP headers), a VXLAN-encapsulated packet on the wire is 1550 bytes. The encapsulation efficiency is 1460/1550 = 94.2%. However, the efficiency drops significantly for smaller packets: a 64-byte TCP ACK packet becomes 114 bytes on the wire after VXLAN encapsulation, yielding only 64/114 = 56.1% efficiency. This means that in environments with a high proportion of small packets (such as financial trading or Redis workloads), the effective throughput of the overlay can be nearly half of the underlay capacity, even though the packet-per-second rate is the same.

The MTU overhead is the most common source of VXLAN performance issues. If the underlay network uses the standard 1500-byte MTU and a server sends a 1500-byte packet into the overlay, the VTEP must add the 50-byte VXLAN header, creating a 1550-byte frame that exceeds the underlay MTU. The VTEP has two options: fragment the packet (which requires the VTEP to reassemble fragments at the destination, adding CPU overhead and potential for reassembly failure) or drop the packet and send an ICMP Fragmentation Needed message (Type 3, Code 4). The ICMP approach relies on Path MTU Discovery (PMTUD) to reduce the TCP MSS on the endpoint, but PMTUD is frequently blocked by firewalls that filter ICMP traffic, resulting in silent packet drops and TCP connection stalls.

Software VTEP implementations (such as Open vSwitch or the Linux VXLAN kernel module) face additional performance constraints. In the Linux kernel, VXLAN encapsulation and decapsulation occur in the kernel networking stack, which involves a full traversal of the transmit and receive paths including checksum computation, skb allocation, and netfilter hook processing. Benchmarks show that a single CPU core in a modern x86 server can sustain approximately 1.2 million VXLAN-encapsulated packets per second for 64-byte packets, dropping to approximately 800,000 PPS for packets requiring checksum offload. For a 40Gbps link running at line rate (approximately 14.88 million PPS for 64-byte packets), this means that software VXLAN processing requires 12-18 CPU cores just for encapsulation overhead, severely limiting the server's ability to run application workloads.

Hardware offload of VXLAN encapsulation (known as VXLAN Offload or NVGRE Offload) moves the encapsulation processing to the NIC's hardware. Modern SmartNICs (such as NVIDIA ConnectX-7 or Intel E810) implement full VXLAN offload in the ASIC, performing encapsulation, checksum offload, TSO (TCP Segmentation Offload), and RSS (Receive Side Scaling) directly in hardware. With hardware offload enabled, the CPU sees only the inner packet, and the VXLAN header processing adds less than 1 microsecond of latency while consuming zero CPU cycles. In testing, a server with VXLAN hardware offload achieves 95% of the raw wire throughput compared to the non-VXLAN baseline, with the 5% loss attributed entirely to the 50-byte encapsulation overhead reducing the effective payload per frame.