FHRP: HSRP & VRRP Engineering Guide

The Virtualization of the Gateway

In a standard TCP/IP configuration, a host is assigned a static Default Gateway IP. If that router hardware fails, every host on the subnet loses all off-link connectivity — a fundamental single point of failure in the architecture. FHRP solves this by creating a Virtual IP (VIP) and a corresponding Virtual MAC (VMAC) that are shared between a group of physical routers. Hosts point to the VIP as their gateway; the FHRP protocol elects which physical router currently "owns" that VIP and responds to ARP queries with the VMAC.

R1 Status

R2 Status

VIP: 192.168.1.1 (ACTIVE: R1)

Master Election

R1 (Pri 110) is the Master. If R1 fails, R2 (Pri 100) waits for the "Hold Timer" to expire before declaring itself Master.

Gratuitous ARP

Upon failover, R2 sends a GARP. Watch the pulse! This updates the L2 Fabric so frames are steered to the new physical port.

Click the router icons to simulate granular failures.

The State Machine Calculus

An FHRP router does not simply become "Active." It traverses a rigorous state machine to ensure no two routers claim the VIP simultaneously, which would cause a MAC address conflict and massive packet loss.

Learn & Listen

The router waits for hellos from the current Active. If it hears one, it learns the VIP and stays in the Listen state.

Speak

If no hellos are heard, the router starts sending its own hellos to challenge for the role, participating in an election.

Active / Master

The router wins the election, assumes the VIP, and begins responding to ARP requests using the VMAC.

Timer Optimization & Convergence Math

The speed of failover is governed by the Hold Timer (HSRP) or Master Down Interval (VRRP). In VRRP, the interval is calculated using a formula that includes the router's priority to prevent simultaneous transitions:

\text{Master Down Interval} = (3 \times \text{Advertisement Interval}) + \text{Skew Time}

\text{Skew Time} = \frac{256 - \text{Priority}}{256}

This "Skew Time" ensures that the router with the highest priority (lowest skew) transitions to Master slightly faster than its peers, minimizing the window of time where multiple routers might attempt to transition.

HSRP vs. VRRP: A Comparative Analysis

While both protocols achieve the same gateway virtualization goal, their implementation, multicast groups, and terminology differ in important ways:

HSRP (Hot Standby Router Protocol): Cisco-proprietary (standardized in RFC 2281). Uses 'Active' and 'Standby' roles. Hello messages sent to 224.0.0.2 every 3 seconds by default, with a 10-second hold timer. HSRPv2 extends group numbers to 0-4095.
VRRP (Virtual Router Redundancy Protocol): Open standard (RFC 5798). Uses 'Master' and 'Backup' roles. Hello messages sent to 224.0.0.18 every 1 second by default, with a 3-second master-down interval — giving inherently faster default failover than HSRP.

\text{Priority Range} = 0 \text{ to } 255

The router with the highest configured priority becomes the Active/Master and owns the VIP. In the event of a priority tie, the router with the highest physical interface IP address wins the election. Default priority is 100 in both HSRP and VRRP.

Virtual MAC Mechanics

To prevent hosts from needing to clear their ARP caches during a failover (which would cause a brief traffic interruption during ARP refresh), FHRP uses a pre-defined Virtual MAC address that never changes, regardless of which physical router currently holds the Active role:

VRRP VMAC: $0000.5e00.01XX$ (where XX is the VRID in hex).

FHRP Packet Deconstruction

To understand how an FHRP failover works at the wire level, we must look at the encapsulation of the "Hello" or "Advertisement" packet. These frames carry the vital signs of the Active router.

HSRPv2 Frame Anatomy (UDP 1985)

Version (8-bit)Value: 0x02

OpCode (8-bit)0=Hello, 1=Coup, 2=Resign

State (8-bit)0=Init ... 16=Active

Group (16-bit)0 to 4095

Note: HSRPv2 uses a Type-Length-Value (TLV) format, allowing it to include secret authentication strings up to 64 bytes. If the authentication string does not match across all members, the group will never consolidate, leading to multiple Active routers.

BFD: The Catalyst for Sub-Second Failover

Standard FHRP timers (3s/10s) are too slow for modern converged networks. While tuning them to 100ms/300ms is possible, it places a significant load on the CPU. **Bidirectional Forwarding Detection (BFD)** provides a lightweight, asynchronous heartbeat mechanism that can detect a link failure in under **50 milliseconds**.

\text{Total Failover Time} = \text{BFD Detection Time} + \text{FHRP State Transition}

By offloading failure detection to the BFD engine (often running in the hardware ASIC/NPU), the router can achieve carrier-grade failover speeds without risking control-plane instability.

The Multicast Heartbeat & VMAC Mapping

FHRP relies on precise Layer 2 multicast mechanics. If the multicast frames are blocked or delayed, the standby routers will assume the Active has failed, leading to a "Split-Brain" scenario where multiple routers claim the VIP.

Virtual MAC Schema

Protocol	Multicast IP	Virtual MAC Base	Max Groups
HSRPv1	224.0.0.2 (UDP 1985)	0000.0C07.ACxx	256
HSRPv2	224.0.0.102 (UDP 1985)	0000.0C9F.Fxxx	4096
VRRPv2	224.0.0.18 (IP 112)	0000.5E00.01xx	255
GLBP	224.0.0.102 (UDP 3222)	0007.B4xx.xxxx	1024

GLBP: True Gateway Load Balancing

HSRP and VRRP are essentially "Active/Standby" protocols. Even if you configure multiple groups (Multi-Group HSRP), half your potential bandwidth sits idle. **GLBP (Gateway Load Balancing Protocol)** solves this by allowing a single group to utilize up to four physical routers simultaneously for traffic forwarding.

The AVG/AVF Architecture

Active Virtual Gateway (AVG)

The AVG is the control-plane brains of the operation. It answers all ARP requests for the VIP. Instead of always giving the same MAC address, it rotates through the Virtual MAC addresses of the available forwarders.

Active Virtual Forwarder (AVF)

Up to four routers per group are designated as AVFs. Each AVF is assigned a unique VMAC. When a host sends traffic to its assigned VMAC, that specific physical router handles the traffic.

GLBP's primary advantage is that it provides automated load balancing without requiring the administrative overhead of managing multiple subnets or complex client-side configuration. However, it is a Cisco-proprietary protocol and relies heavily on correct ARP behavior from clients.

Design Patterns: FHRP vs. vPC and MLAG

In traditional designs, FHRP was the only way to provide gateway redundancy. However, the rise of **Multi-Chassis Link Aggregation (MLAG)** and technologies like Cisco's **vPC (virtual Port Channel)** or **StackWise Virtual** has changed the landscape.

The FHRP Approach

Relies on a single logical control plane but dual data planes. Standby routers do not forward traffic (except in GLBP), leading to inefficient use of uplinks. Spanning Tree (STP) must still block one path to prevent loops.

The MLAG / vPC Approach

Creates a single logical switch across two physical chassis. This allows all uplinks to be Active/Forwarding simultaneously. In this world, FHRP is often configured in "Active-Active" mode where both switches forward traffic for the same VMAC locally.

Redundancy in the IPv6 Era

IPv6 introduces fundamental changes to first-hop redundancy. While IPv4 relies on ARP, IPv6 uses **Neighbor Discovery Protocol (NDP)** and **ICMPv6 Router Advertisements (RA)**. This impacts how FHRP protocols must function.

The VRRPv3 Paradigm

VRRPv3 is the modern standard for IPv6 redundancy. Unlike VRRPv2, it is optimized for high-performance networks and supports both IPv4 and IPv6 in a single protocol framework.

IPv6 Multicast Address: FF02::12Virtual MAC Range: 0000.5E00.02xx

Critical: In IPv6, the gateway is usually the **Link-Local Address (FE80::/10)**. FHRP protocols in IPv6 must ensure that they virtualize both the Global Unicast Address and the Link-Local address to ensure seamless failover for hosts using SLAAC.

FHRP and Hardware Load Balancers

When deploying high-performance Application Delivery Controllers (ADCs) like F5 BIG-IP or Citrix ADC, the interaction with FHRP becomes a critical architectural decision. There are two primary deployment models: **Inline** and **SNAT-based**.

The Inline (Gateway) Model

In the inline model, the ADC itself acts as the default gateway for the server subnet. To provide redundancy for the ADC, it must participate in an FHRP (usually VRRP).

// The Traffic Flow

Client -> Core Switch -> ADC VIP -> Server -> ADC (Gateway) -> Core Switch

Failure to synchronize the FHRP state with the ADC's session table can lead to "Zombied Connections" — where a failover occurs but the new Master has no record of the existing TCP flows, forcing a reset.

FHRP Security: Authentication Forensics

Because FHRP protocols control the default gateway, they are a high-value target for **Man-in-the-Middle (MitM)** attacks. If an attacker can inject a high-priority hello packet into the network, they can become the "Active" gateway and intercept all outbound traffic.

HSRP MD5 Authentication

Unlike the default "plain-text" authentication (which is easily sniffed), **MD5 authentication** uses a secret key to generate a hash for every hello packet. The key itself is never sent over the wire. This prevents spoofing and replay attacks.

VRRPv3 Security (or lack thereof)

Interestingly, RFC 5798 **removed** authentication from VRRPv3. The IETF argued that Layer 2 security should be handled by IPSec or 802.1X, and that protocol-level authentication provided a false sense of security while complicating the implementation.

The Evolution: Anycast Gateways

In modern **EVPN-VXLAN** data center fabrics, traditional FHRP is becoming obsolete. Instead of electing one "Active" router among a pair, we use an **Anycast Gateway**.

distributed-anycast-gateway

Every Leaf switch in the fabric is configured with the **same** Virtual IP and the **same** Virtual MAC address for a given VLAN. When a host sends a packet to its gateway, it is handled by the immediate upstream Leaf, regardless of whether that Leaf is part of a pair.

- **Zero Failover Time:** There is no "Standby" router – every router is active.
- **Optimized Traffic Flow:** Traffic never crosses an inter-chassis link to reach a gateway; it is always routed locally at the first hop.
- **Infinite Scalability:** You can add as many Leaf switches as needed without complicating the redundancy logic.

While Anycast Gateways are the gold standard for data centers, HSRP and VRRP remain the workhorses of the Campus and Branch environments where the underlying fabric is still traditional Layer 2.

Forensic Troubleshooting: The Split-Brain

A "Split-Brain" or "Dual-Active" scenario is the most catastrophic failure in an FHRP deployment. It occurs when two routers lose their heartbeat link but their LAN interfaces remain active. Both routers assume the Active role, creating a MAC address conflict and causing intermittent packet loss.

Root Causes of Split-Brain

VLAN Pruning:If the FHRP control-plane VLAN is pruned from a trunk but the data VLAN is allowed, hellos stop but traffic remains — a recipe for disaster.
CPU Exhaustion:During a DoS attack, the CPU may be too busy to process FHRP hellos, causing a standby router to falsely assume a failure.
MTU Mismatch:If the MTU on the FHRP interface is smaller than the hello packet (rare but possible with GRE/IPsec), hellos may be fragmented and dropped.

The Scale of Redundancy: ASIC & TCAM

In a service provider or large enterprise environment, a single pair of core switches might manage 500+ VLANs, each with its own FHRP group. This creates a significant burden on both the CPU (for hellos) and the hardware ASIC (for VMAC forwarding).

TCAM and Entry Limits

The **TCAM (Ternary Content-Addressable Memory)** stores the lookup tables for Layer 2 and Layer 3 forwarding. Each VMAC used by HSRP/VRRP consumes a hardware entry.

CPU Interrupt Load

Processing hellos for 500 groups every 100ms could require 5,000 interrupts per second. On older supervisors, this can lead to "Control Plane Saturation," where the router becomes sluggish and starts dropping BGP or OSPF packets.

The Multi-Group Optimization

Modern Cisco IOS-XE uses **HSRP Groups with Slaves** (or VRRP Groups). You define one "Master" group that handles the hello traffic/state machine, and then attach multiple "Slave" groups that follow the Master's state transitions without sending their own hellos. This reduces CPU load by 90%+.

Case Study: The 'Hidden' MTU Failure

A global financial firm experienced intermittent HSRP failovers every 30 minutes. Forensic analysis revealed that a new security policy was generating 1500-byte ICMP Error packets. When HSRP hellos (usually small) were sent simultaneously with these large packets, the access switch's queue would overflow, dropping exactly ONE hello. Because their hold timer was tuned to 3 seconds (3 missed hellos), it took a specific burst pattern to trigger the failure. The solution was not to increase timers, but to implement **Control Plane Policing (CoPP)** to prioritize FHRP traffic over general ICMP.

Object Tracking and WAN-Awareness

A router can be 'healthy' internally — all interfaces up, FHRP hellos exchanged — but have a failed upstream WAN link, making it unable to reach the internet. Without WAN-awareness, FHRP would keep it as the Active gateway, silently blackholing all traffic. Advanced FHRP implementations use Object Tracking to solve this: if the tracked object (upstream interface, IP SLA reachability test) fails, the router automatically decrements its priority by a configured amount, triggering a graceful preemption to the peer that still has internet reach.

\text{Effective Priority} = \text{Configured Priority} - \text{Track Decrement}

Conclusion: The Architecture of Availability

First Hop Redundancy is the cornerstone of high-availability network design. Whether you are utilizing the venerable HSRP, the open-standard VRRP, or the sophisticated load-balancing powers of GLBP, the goal remains the same: the elimination of the single point of failure at the edge.

As we move toward more automated and distributed environments, the protocols we use to manage the gateway are evolving. The shift from Active/Standby to Anycast-based models reflects a broader trend in engineering: the move from reactive recovery to proactive availability.

In the modern network, redundancy is not just a feature — it is the baseline requirement. Precision tuning of timers, forensic understanding of multicast heartbeats, and awareness of the underlying hardware constraints are the marks of an elite infrastructure engineer.

FHRP Timer Convergence: The Mathematics of Hold Time, Hello Interval, and Skew Time

The failover time of any First Hop Redundancy Protocol is determined by the interaction of three timers: Hello Interval (how often the Active sends a hello), Hold Time (how long the Standby waits without hearing a hello before declaring the Active dead), and Skew Time (a random offset added by subordinate routers in multi-group environments to prevent all backups from timing out simultaneously). The mathematics governing these timer interactions determines whether a failover takes 3 seconds or 30 milliseconds.

HSRP Timer States

HSRP defines three timer states: Active Timer (monitored by the Standby — if this reaches zero, the Standby becomes Active), Standby Timer (monitored by the Active — if the Active stops receiving hellos from the Standby, it knows the backup is gone and can stop sending acknowledgments), and the Hello Timer (the Active sends HSRP Hello messages at this interval, default 3 seconds). The Hold Time defaults to 10 seconds (3 × Hello + 1 second skew), meaning a single lost link causes a 10-second detection window before the Standby takes over.

T_{\\text{failover}} = T_{\\text{hold}} + T_{\\text{skew}} + T_{\\text{hello-loss}} + T_{\\text{preempt-delay}}

With defaults ( $T_{\\text{hold}} = 10$ , $T_{\\text{skew}} = 0$ for directly peers, $T_{\\text{hello-loss}} = 0\\!$ if the failure is detected via a Layer 1 carrier loss), the failover time is consistently near 10 seconds. Sub-second failover requires aggressive timer tuning: setting the Hello Interval to 100 ms and the Hold Time to 300 ms gives a maximum failover time of 300 ms plus the time required for the Standby to send a gratuitous ARP (one Ethernet frame — approximately 5 microseconds on a 1 Gbps link).

VRRPv3 Skew Time and Priority Math

VRRPv3 (RFC 5798) uses a Skew Time formula that depends on the priority of the backup router. The Master sends advertisements every Advertisement Interval ( $\\text{AdverInterval}$ , default 1 second). The Backup routers compute their Master Down Interval (MDI) as:

\\text{MDI}_{\\text{Backup}} = (3 \\times \\text{AdverInterval}) + (\\frac{256 - \\text{Priority}}{256}) \\times \\text{AdverInterval}

The second term is the Skew Time. A Backup with Priority 200 has a Skew Time of $((256-200)/256) \\times 1 = 0.219 \\, \\text{s}$ , giving an MDI of $3 + 0.219 = 3.219 \\, \\text{s}$ . A Backup with Priority 100 has a Skew Time of $((256-100)/256) \\times 1 = 0.609 \\, \\text{s}$ , giving an MDI of $3 + 0.609 = 3.609 \\, \\text{s}$ . The higher-priority Backup times out first, ensuring that the preferred standby takes over in a multi-backup topology. The Skew Time prevents the "Thundering Herd" problem where multiple Backups simultaneously assume the Master role.

Sub-Second Timer Risks: The Link Bounce Cascade

Setting the HSRP Hello Interval to 50 ms and Hold Time to 150 ms enables failover in under 200 ms. However, this aggressive timer setting creates a risk: if a flapping optical transceiver causes three consecutive hellos to be lost within 50 ms each (e.g., due to bit errors corrupting the hello packets on the wire), the Standby detects a false failure and initiates a preemption. The original Active recovers, sends a hello, and triggers another preemption. This cycle — Link Bounce Cascade — causes continuous gateway flapping. The engineering solution is to implement BFD (Bidirectional Forwarding Detection) for link failure detection and use a more conservative FHRP Hold Time (500 ms to 1 second) as a "shock absorber" against transient bit errors. BFD detects the actual physical link failure in 50 ms, while the longer Hold Time prevents false-positive BGP-style failures from propagating to the FHRP state machine.

VRRPv3 for IPv6: The NDP Convergence and RA Intercept Problem

VRRPv3 was designed from the ground up to support both IPv4 and IPv6 in a unified protocol specification. Unlike HSRP which had to be retrofitted for IPv6 (resulting in two separate implementations — HSRPv1 for IPv4 and HSRPv2 for both), VRRPv3 uses a single packet format with AFI (Address Family Identifier) to distinguish between IPv4 and IPv6 advertisements. However, the IPv6 deployment of VRRP introduces unique challenges related to Neighbor Discovery Protocol (NDP) and Router Advertisement (RA) interception.

The IPv6 Virtual Router MAC

VRRPv3 for IPv6 uses the same Virtual MAC (VMAC) as IPv4: $00:00:5E:00:01:\\{VRID\\}$ . The Virtual Router ID (VRID) occupies the last octet, allowing up to 255 VRRP groups. However, IPv6 hosts do not use ARP — they use NDP. The VRRP Master must send Unsolicited Neighbor Advertisements (UNA) for the Virtual IPv6 address (the VIP) with the Override (O) flag set to $1$ . This instructs all hosts on the segment to replace any existing NDP cache entry for the VIP with the VMAC. If the Master fails to send the UNA, hosts retain the old Master's physical MAC in their NDP cache, and traffic to the VIP is black-holed until the NDP entry expires (default 30 seconds for reachable entries).

The Router Advertisement Interception Mechanism

In an IPv6 network, hosts learn the default gateway address through Router Advertisements (RAs). The VRRPv3 Master intercepts RA transmission on the segment: only the Master sends RAs with the Source Link-Layer Address set to the VMAC. The Standby routers suppress their own RA transmission for the virtual router's prefix. This interception is critical because if both Master and Standby send RAs with different source MACs, hosts would create conflicting NDP entries.

The RA suppression is achieved through a State Machine Override: when a VRRPv3 router transitions to Backup state, it stops sending RAs for the VIP prefixes and sends a single Router Advertisement with Zero Lifetime to deprecate any existing entries pointing to the Backup's physical MAC. However, a known interoperability issue exists between VRRPv3 and RADV (Router Advertisement Daemon) on Linux-based routers: if the RADV configuration advertises a prefix that overlaps with the VRRPv3 VIP, the RADV and VRRPv3 may both send RAs, creating a "ping-pong" effect where hosts alternate between the two sources. The fix is to configure IPv6 RA Guard on the access-layer switch to filter RAs from non-VRRPv3 sources.

IPv6 Link-Local Address Role in VRRPv3

VRRPv3 uses the link-local address of the Master as the source IP for all VRRP advertisements (with destination IP $FF02:0:0:0:0:0:0:12$ — the VRRP IPv6 multicast address). If the Master's link-local address changes (which should not happen — link-local addresses are derived from the interface MAC and are stable), all peers must re-learn the source. This is why VRRPv3 mandates that the link-local address is derived from the physical interface MAC, not from the VMAC. A mismatch between the link-local source address and the VRRP authentication key exchange can cause Silent Master failures where the Standby drops advertisements due to authentication mismatch. The "VRRP Authentication Disabled" (Type 1 — No Authentication) VRRPv3 mode avoids this entirely by relying on the IPsec AH header for authentication, which is the recommended configuration for IPv6 VRRPv3 deployments.

Related Engineering Resources

Technical Article

In a Nutshell

The Virtualization of the Gateway

Master Election

Gratuitous ARP

The State Machine Calculus

Learn & Listen

Speak

Active / Master

Timer Optimization & Convergence Math

HSRP vs. VRRP: A Comparative Analysis

Virtual MAC Mechanics

FHRP Packet Deconstruction

HSRPv2 Frame Anatomy (UDP 1985)

BFD: The Catalyst for Sub-Second Failover

The Multicast Heartbeat & VMAC Mapping

Virtual MAC Schema

GLBP: True Gateway Load Balancing

The AVG/AVF Architecture

Active Virtual Gateway (AVG)

Active Virtual Forwarder (AVF)

Design Patterns: FHRP vs. vPC and MLAG

The FHRP Approach

The MLAG / vPC Approach

Redundancy in the IPv6 Era

The VRRPv3 Paradigm

FHRP and Hardware Load Balancers

The Inline (Gateway) Model

FHRP Security: Authentication Forensics

HSRP MD5 Authentication

VRRPv3 Security (or lack thereof)

The Evolution: Anycast Gateways

distributed-anycast-gateway

Forensic Troubleshooting: The Split-Brain

Root Causes of Split-Brain

The Scale of Redundancy: ASIC & TCAM

TCAM and Entry Limits

CPU Interrupt Load

The Multi-Group Optimization

Case Study: The 'Hidden' MTU Failure

Object Tracking and WAN-Awareness

Conclusion: The Architecture of Availability

FHRP Timer Convergence: The Mathematics of Hold Time, Hello Interval, and Skew Time

HSRP Timer States

VRRPv3 Skew Time and Priority Math

VRRPv3 for IPv6: The NDP Convergence and RA Intercept Problem

The IPv6 Virtual Router MAC

The Router Advertisement Interception Mechanism

Related Engineering Resources

DHCP Relay

OSPF Convergence

Theoretical RTT

Technical Standards & References

Related Engineering Resources

DHCP Relay

OSPF Convergence

Theoretical RTT