Network Telemetry vs. SNMP
The Operational Divide in Modern Network Observability
The SNMP Architecture: A Polling-Based Legacy
SNMP was designed in 1988 (RFC 1067) when networks were small, devices were few, and management stations were powerful enough to query everything. Its architecture is fundamentally a request/response (pull) model: the Network Management System (NMS) periodically polls each device, requesting specific OIDs (Object Identifiers) from the device's MIB (Management Information Base).
The gRPC Transport Hydraulics
To understand why modern telemetry is superior, we must look at the **gRPC (Google Remote Procedure Call)** transport layer. Unlike SNMP, which relies on the stateless, unreliable UDP protocol, gRPC is built on **HTTP/2** over **TCP**.
Binary Serialization (Protobuf)
While SNMP uses ASN.1 BER (Basic Encoding Rules), which is verbose and computationally expensive to parse, gRPC uses **Protocol Buffers (Protobuf)**. Protobuf is a binary format that is 3x to 10x smaller than XML/JSON and significantly faster to serialize/deserialize, reducing the CPU burden on the network processor (control plane).
HTTP/2 Multiplexing
SNMP requires a new UDP request for every set of OIDs (or a series of requests for large tables). gRPC leverages HTTP/2 multiplexing, allowing multiple telemetry streams to coexist on a single long-lived TCP connection. This eliminates the "head-of-line blocking" found in older protocols.
Bi-Directional Streaming
Once the subscription is established, the device can push data indefinitely. The collector doesn't need to re-authenticate or re-request data; it simply listens to a continuous "firehose" of binary-encoded network state.
The Scaling Problem: Why SNMP Breaks at Hyperscale
The core problem with SNMP is its polling latency. If you poll a device every 5 minutes, you only know the state of the network as it was 5 minutes ago. A link that flaps 50 times between polls generates just one metric change in your monitoring system — the final state. This is a critical observability gap for modern networks where failures propagate in milliseconds.
The Polling Cost Function
The total polling load on an NMS grows linearly with both devices and metrics:
Polls/hour = (Devices × Metrics per Device) / Poll Interval (min) × 60
For 5,000 devices with 50 metrics each, polled every 1 minute: 15,000,000 GET requests per hour. The NMS becomes a significant load generator on the network itself.
Streaming Telemetry: The Push-Based Alternative
Streaming telemetry inverts the model. Instead of the management system requesting data, the network device pushes data to a collector at a pre-configured interval or on change. The modern standard for this is gNMI (gRPC Network Management Interface), developed by the OpenConfig working group.
YANG Data Models: The Structured Schema
At the heart of modern telemetry is YANG (Yet Another Next Generation), a data modeling language defined in RFC 7950. Unlike SNMP MIBs, which are flat and vendor-specific, YANG models define structured, hierarchical schemas that can be vendor-neutral (OpenConfig models) or vendor-specific (Cisco YANG, Juniper YANG).
A gNMI path to interface statistics on an OpenConfig model looks like a filesystem path:
/* OpenConfig gNMI path for interface counters
/interfaces/interface[name=GigabitEthernet0/0]/state/counters/in-octets
This path-based approach is immediately human-readable and programmatically navigable, unlike an SNMP OID such as 1.3.6.1.2.1.2.2.1.10.
MDT vs. Event-Driven Telemetry
A common misconception is that all streaming telemetry is the same. In reality, we distinguish between **Model-Driven Telemetry (MDT)** and **Event-Driven Telemetry (EDT)**.
Model-Driven (MDT)
MDT maps internal device data structures (counters, FIB entries, temperature sensors) to a YANG model. The device kernel pushes this raw data directly to the line card or supervisor CPU for transmission. This is highly efficient and provides granular "snapshots" of performance.
- 01 In-Kernel Extraction
- 02 Periodic interval (SAMPLE)
Event-Driven (EDT)
EDT triggers only when a specific threshold is met or a state change occurs (e.g., optical power drops below -15dBm). This is the "smarter" cousin of the SNMP Trap, but with reliable transport and rich YANG context.
- 01 Threshold Violation
- 02 State Change (ON_CHANGE)
The Modern Telemetry Pipeline Architecture
A production-grade streaming telemetry deployment follows a pipeline architecture with four distinct stages:
1. Collectors
Agents (Telegraf, gNMIc) that establish gRPC connections to network devices, receive streamed data, and normalize it into a standard format (InfluxDB line protocol, Protobuf).
2. Message Bus
Kafka or NATS provides a high-throughput, durable message queue between collectors and the processing layer, decoupling ingestion from analysis.
3. Time-Series Database
InfluxDB, TimescaleDB, or VictoriaMetrics stores the time-indexed telemetry data for querying and long-term analysis.
4. Visualization & Alerting
Grafana dashboards and alert rules consume the time-series data, providing sub-second observability metrics that are impossible with SNMP polling intervals.
Closed-Loop Automation Integration
The true power of streaming telemetry is realized when it is coupled with **Closed-Loop Automation (CLA)**. In this architecture, telemetry data serves as the feedback signal in a PID-like control system for the network.
The device streams interface counters via gNMI every 100ms. The collector detects a rapid spike in egress drops on a specific port.
An automation engine (e.g., Ansible EDA or StackStorm) evaluates the telemetry against an SLO. The drop rate exceeds 0.01%.
The engine executes a NETCONF or gNMI `SET` operation to modify the traffic shaper or reroute high-priority flows to a different path.
"Telemetery makes the network self-healing. Without the low-latency push model, automation is always chasing the ghost of a past state."
The Collector Complexity: Protobuf & High Cardinality
Implementing a telemetry collector is significantly more complex than writing an SNMP polling script. The collector must manage long-lived gRPC channels, handle certificate-based mutual authentication (mTLS), and—most importantly—perform high-speed Protobuf decoding.
Because gRPC is binary, the collector needs the exact `.proto` and YANG files used by the device to deserialize the message. If the vendor updates their software version and changes the schema, the collector may fail to parse the data unless its local proto definitions are updated. This creates a "schema versioning" challenge that was less severe in the MIB world.
Furthermore, the **High Cardinality** of telemetry data (1,000s of metrics per second across 1,000s of devices) requires a highly performant downstream storage engine. Traditional relational databases often crumble under this write-heavy load, necessitating the use of specialized time-series databases (TSDBs).
Operational Comparison Summary
| Criterion | SNMP Polling | Streaming Telemetry (gNMI) |
|---|---|---|
| Latency to detect event | Poll interval (typically 5 min) | Seconds or immediate (ON_CHANGE) |
| Transport protocol | UDP (unreliable) | TCP/gRPC (reliable, encrypted) |
| Data model | Vendor-specific MIBs | Standardized YANG (OpenConfig) |
| Security | Community strings (v2c), basic auth | mTLS, certificate-based auth |
| NMS CPU load | High (NMS drives all requests) | Low (device drives pushes) |
SNMP is a reliable workhorse for small, static networks where 5-minute polling latency is acceptable. For any network with more than a few hundred devices, high-velocity events (link flapping, microbursts), or a security posture that disallows community strings in cleartext — streaming telemetry via gNMI is the only architecture that can deliver the sub-minute observability that modern operations require. The migration investment in YANG model familiarity and pipeline tooling pays for itself the first time you detect and remediate a cascading failure before users open a ticket.
The Telemetry Encyclopedia
gNMI
gRPC Network Management Interface. The industry standard protocol for streaming telemetry and configuration management.
YANG
Yet Another Next Generation. A data modeling language used to define the hierarchical schema of the data being streamed.
Protobuf
Protocol Buffers. Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data.
MIB
Management Information Base. A hierarchical database used by SNMP to define manageable objects.
OID
Object Identifier. A numeric string (e.g., 1.3.6.1...) used to uniquely identify an object in a MIB.
Shadowing
A telemetry technique where the device maintains a 'shadow copy' of the operational state to minimize kernel-to-user space context switching.
Dial-In
A connection mode where the collector initiates the gRPC session to the network device.
Dial-Out
A connection mode where the network device initiates the gRPC session to a pre-configured collector.
Encoding: JSON_IETF
A standardized JSON representation of YANG-modeled data, often used in NETCONF/RESTCONF.
Encoding: BYTES
A raw binary encoding often used for high-throughput performance metrics.
Leaf
A terminal node in a YANG tree that contains a specific value (e.g., an interface description).
Container
A structural node in a YANG tree used to group related nodes (e.g., all interfaces).
XPath
A query language used to identify specific paths within a YANG schema (e.g., /interfaces/interface/state).
gNOI
gRPC Network Operations Interface. A set of gRPC services for operational actions like rebooting, pinging, or file transfer.
gRIBI
gRPC Routing Information Base Interface. A protocol for external controllers to program the RIB of a device.
Flow Labels
In IPv6, a field used to identify packets of a specific flow for consistent telemetry tracking.
NetFlow/IPFIX
A specialized form of telemetry focused on traffic flows rather than device counters.
sFlow
A multi-vendor standard for packet sampling, strictly push-based and high-velocity.
mTLS
Mutual TLS. A security requirement for gNMI where both the device and collector must present valid certificates.
Telegraf
A popular open-source collector agent that supports gNMI and SNMP inputs.
Related Engineering Resources
gNMI Protocol Deep Dive: The Model-Driven Telemetry Standard
The gRPC Network Management Interface (gNMI) is the protocol that has emerged as the industry standard for streaming telemetry from network devices, and understanding its architecture is essential for any network engineer deploying modern observability infrastructure. gNMI is built on three foundational technologies: gRPC (Google's Remote Procedure Call framework) as the transport protocol, Protocol Buffers (protobuf) as the data serialization format, and YANG (Yet Another Next Generation) as the data modeling language. The gNMI client (typically a telemetry collector such as Telegraf, Prometheus, or a vendor-specific collector) establishes a gRPC connection to the gNMI server (running on the network device), and the two sides negotiate the capabilities and encoding formats they support. The gRPC connection uses HTTP/2 as the transport protocol, which provides multiplexing (multiple gNMI subscriptions can share the same TCP connection), flow control, and TLS encryption (gNMI requires TLS 1.2 or higher for production deployments). The gRPC connection is long-lived: once established, the connection persists indefinitely, with the client and server exchanging keepalive pings every 30-60 seconds to verify that the connection is still healthy. This persistent connection model is fundamentally different from SNMP's connectionless UDP model, and it provides the reliability guarantees (TCP delivery, TLS encryption, gRPC error handling) that are essential for production telemetry systems.
The gNMI subscription mechanism is the core of the protocol's telemetry capability. The client sends a Subscribe RPC (Remote Procedure Call) that specifies the YANG path of the data to be streamed, the subscription mode (ON_CHANGE, SAMPLE, or TARGET_DEFINED), and the sampling interval (for SAMPLE mode). The ON_CHANGE mode is the most efficient for data that changes infrequently (such as interface administrative status or BGP session state), as the device sends an update only when the value changes, eliminating the periodic polling overhead. The SAMPLE mode is used for data that changes continuously (such as interface byte counters or CPU utilization), where the device samples the value at the specified interval (typically 10-60 seconds) and sends the sampled value to the collector. The TARGET_DEFINED mode allows the device to determine the optimal subscription mode and interval for each data path, which simplifies the client configuration but requires the device to implement intelligent subscription management. The gNMI specification also supports "heartbeat" intervals that ensure the collector receives periodic updates even for unchanged ON_CHANGE subscriptions, preventing the collector from incorrectly inferring that the device has failed when no updates are received within the expected timeframe.
The YANG data model that underlies gNMI telemetry provides a standardized, self-documenting schema for the telemetry data that the device exposes. Each YANG model defines the hierarchy of data paths, the data types of each leaf, and the constraints on the values (valid ranges, mandatory fields, uniqueness constraints). The standardization of YANG models through the IETF (RFC 8344 for IP, RFC 8343 for interfaces, RFC 8349 for routing) and the OpenConfig working group provides multi-vendor interoperability: an OpenConfig-interfaces YANG model that defines interface counters will use the same data path (e.g., /interfaces/interface/state/counters/in-octets) across all vendors that support the model. This multi-vendor standardization is the most significant advantage of gNMI-based telemetry over SNMP-based monitoring, where each vendor uses different OID hierarchies and proprietary MIBs that require vendor-specific knowledge to interpret. The deployment of gNMI telemetry starts with the selection of the YANG models to use (OpenConfig for multi-vendor deployments, vendor-native for single-vendor deployments), followed by the configuration of the gNMI server on the network devices (including TLS certificates for authentication and encryption), and finally the configuration of the telemetry collector to subscribe to the desired data paths from each device.
The operational management of a gNMI telemetry deployment introduces new challenges that the network engineering team must address. The first challenge is TLS certificate management: each gNMI server requires a valid TLS certificate, and the certificate must be renewed before it expires to maintain the gRPC connection. The recommended approach is to deploy a Certificate Authority (CA) that issues short-lived certificates (1-3 months) to each network device, combined with automated certificate renewal using the Automated Certificate Management Environment (ACME) protocol or a vendor-specific certificate enrollment process. The second challenge is gNMI subscription capacity planning: each gNMI subscription consumes CPU and memory on the device, and a device with a large number of subscriptions (100+ per device) may experience performance degradation. The recommended subscription limit is 20-30 subscriptions per device for mainstream network devices (Cisco Nexus 9000, Arista 7280, Juniper MX), with each subscription streaming a focused set of data paths rather than subscribing to the entire YANG tree in a single subscription. The third challenge is data volume management: a single gNMI subscription streaming interface counters at 10-second intervals from a device with 100 interfaces generates approximately 720,000 data points per day, and a deployment monitoring 1,000 devices generates 720 million data points per day—requiring careful planning of the time-series database capacity and the data retention policies.
The evolution of gNMI is moving toward the integration of analytics and alerting directly into the telemetry pipeline. The gNMI specification includes support for "dial-out" mode, where the device initiates the gRPC connection to the collector, eliminating the need for the collector to maintain a connection table for thousands of devices. The OpenConfig working group is developing "on-change streaming with precomputed analytics," where the device computes derived metrics (such as the rate of change of interface counters, or the percentage utilization of a link) and streams only the derived metrics rather than the raw counters, reducing the data volume by 10-100x. The "gNMI Gateway" concept, where a centralized gateway terminates the gNMI connections from devices and provides a single northbound interface to the monitoring systems, simplifies the deployment of gNMI in large networks by abstracting the device-level connection management from the monitoring systems. These developments, combined with the growing support for gNMI across all major network vendors, are positioning gNMI as the universal telemetry protocol that will eventually replace SNMP-based polling for all modern network monitoring deployments. The network engineer who invests in learning gNMI today will be well-prepared for the telemetry-driven network operations model that will define the network engineering profession for the next decade.
Migration Strategy: Transitioning from SNMP-Polled Monitoring to Streaming Telemetry
The migration from SNMP-based monitoring to streaming telemetry is one of the most significant infrastructure projects that a network engineering organization can undertake, and it requires careful planning to avoid disrupting existing monitoring coverage during the transition. The recommended migration strategy follows a phased approach that spans 6-12 months. Phase 1 (Months 1-2) is the "Discovery and Planning" phase: the team inventories all existing SNMP monitoring configurations (which devices, which OIDs, which polling intervals), identifies the OpenConfig or vendor-native YANG paths that correspond to each monitored data point, and creates a migration plan that prioritizes the most critical monitoring streams (interface utilization, CPU/memory utilization, BGP session state, environmental sensors). Phase 2 (Months 3-4) is the "Pilot Deployment" phase: the team enables gNMI telemetry on a small subset of devices (typically 10-20 devices in a single data center or regional office), configures the telemetry collector to subscribe to the priority data streams, and validates that the streaming telemetry data matches the SNMP-polled data within acceptable accuracy margins (typically within 1% for counter-based metrics like interface octets, and within 5% for utilization-based metrics).
The data validation step in Phase 2 is critical for ensuring a successful migration. The team runs both SNMP polling and gNMI streaming simultaneously on the pilot devices, collecting data from both sources and comparing the results. The comparison must account for the different measurement methodologies: SNMP polls the current value of a counter at a fixed interval (e.g., every 60 seconds), while gNMI streams the value at the sampling interval (e.g., every 10 seconds) and can also stream the counter on each change. The SNMP-derived utilization is calculated as (counter_delta / time_delta), while the gNMI-derived utilization can be calculated the same way or can be streamed directly as a derived metric if the device supports precomputed utilization. The comparison should identify any systematic biases in the data: for example, gNMI might report 0.5% higher interface utilization than SNMP because the gNMI sample includes a short burst of traffic that occurs between SNMP polling intervals. The team must establish a data validation threshold (typically ±2% for utilization, ±1% for counter-based metrics) and reject the migration if the validation metrics exceed these thresholds. Any validation failures must be investigated and resolved before proceeding to the wider rollout.
Phase 3 (Months 5-6) is the "Wide Rollout" phase, where gNMI telemetry is enabled on all devices in the network, starting with the core devices (routers and switches in the data center backbone) and expanding to the edge devices (campus switches, branch routers, firewall clusters). The rollout is performed in batches of 50-100 devices per week, with each batch following the same validation process that was established in Phase 2. The SNMP polling is maintained throughout Phase 3, but the polling interval is gradually increased from 60 seconds to 300 seconds as confidence in the gNMI data grows. The SNMP polling infrastructure is kept as a fallback for the duration of Phase 3, in case a device exhibits gNMI reliability issues that require reverting to SNMP polling. The team must establish a "gNMI dashboard" that displays the health of the gNMI telemetry infrastructure: the number of active gNMI subscriptions per device, the subscription error rate, the data delivery latency (the time between data generation on the device and data availability in the time-series database), and the gNMI connection status for each device. Any device with a gNMI error rate exceeding 1% or a data delivery latency exceeding 2x the sampling interval is flagged for immediate investigation and potential rollback to SNMP polling.
Phase 4 (Months 7-12) is the "Optimization and Decommissioning" phase, where the SNMP polling infrastructure is gradually decommissioned as gNMI telemetry is proven to be reliable. The team establishes a "SNMP retirement schedule" that removes SNMP polling for each data type in the order that was established in the migration plan: first to be retired are the basic interface statistics (which are well-covered by standard OpenConfig YANG models), followed by the routing protocol statistics (BGP, OSPF states), followed by the environmental data (temperature, fan speed, power supply status), and finally the vendor-specific data that may require custom YANG models or vendor-native telemetry paths. The retirement schedule includes a "canary period" of 30 days for each data type, where SNMP polling is suspended but the configuration is retained in case a rollback is required. After the 30-day canary period passes without any data quality issues, the SNMP configuration for the retired data type is removed from the monitoring system. The SNMP polling infrastructure itself (the SNMP poller servers, the SNMP community strings, the MIB compilation database) is maintained for an additional 6 months as a contingency, after which it is decommissioned and the associated hardware is repurposed or retired.
The challenges of the SNMP-to-telemetry migration extend beyond the technical implementation to include team training, cultural change, and vendor management. The network engineering team must be trained on the gNMI protocol, the YANG data modeling language, and the new telemetry collector configuration syntax. The monitoring team must adapt their dashboard designs and alerting logic to the streaming telemetry data model, which provides higher resolution data but may require different aggregation and thresholding approaches than the lower-resolution SNMP data. The vendor relationships must be managed to ensure that all network devices in the deployment support the required gNMI capabilities, and that the vendor's software release schedule aligns with the migration timeline. The total cost of the migration must be tracked, including the software licensing costs for the gNMI telemetry features (which may require premium license tiers on some vendor platforms), the hardware costs for the telemetry collector infrastructure, and the labor costs for the migration project team. Despite these challenges, the benefits of the migration—higher resolution data, lower monitoring infrastructure overhead, multi-vendor standardization through OpenConfig YANG models, and the ability to implement real-time network analytics and automation—justify the investment for any organization that is serious about network observability and reliability engineering.