Network Telemetry vs. SNMP
The Operational Divide in Modern Network Observability
The SNMP Architecture: A Polling-Based Legacy
SNMP was designed in 1988 (RFC 1067) when networks were small, devices were few, and management stations were powerful enough to query everything. Its architecture is fundamentally a request/response (pull) model: the Network Management System (NMS) periodically polls each device, requesting specific OIDs (Object Identifiers) from the device's MIB (Management Information Base).
The gRPC Transport Hydraulics
To understand why modern telemetry is superior, we must look at the **gRPC (Google Remote Procedure Call)** transport layer. Unlike SNMP, which relies on the stateless, unreliable UDP protocol, gRPC is built on **HTTP/2** over **TCP**.
Binary Serialization (Protobuf)
While SNMP uses ASN.1 BER (Basic Encoding Rules), which is verbose and computationally expensive to parse, gRPC uses **Protocol Buffers (Protobuf)**. Protobuf is a binary format that is 3x to 10x smaller than XML/JSON and significantly faster to serialize/deserialize, reducing the CPU burden on the network processor (control plane).
HTTP/2 Multiplexing
SNMP requires a new UDP request for every set of OIDs (or a series of requests for large tables). gRPC leverages HTTP/2 multiplexing, allowing multiple telemetry streams to coexist on a single long-lived TCP connection. This eliminates the "head-of-line blocking" found in older protocols.
Bi-Directional Streaming
Once the subscription is established, the device can push data indefinitely. The collector doesn't need to re-authenticate or re-request data; it simply listens to a continuous "firehose" of binary-encoded network state.
The Scaling Problem: Why SNMP Breaks at Hyperscale
The core problem with SNMP is its polling latency. If you poll a device every 5 minutes, you only know the state of the network as it was 5 minutes ago. A link that flaps 50 times between polls generates just one metric change in your monitoring system — the final state. This is a critical observability gap for modern networks where failures propagate in milliseconds.
The Polling Cost Function
The total polling load on an NMS grows linearly with both devices and metrics:
Polls/hour = (Devices × Metrics per Device) / Poll Interval (min) × 60
For 5,000 devices with 50 metrics each, polled every 1 minute: 15,000,000 GET requests per hour. The NMS becomes a significant load generator on the network itself.
Streaming Telemetry: The Push-Based Alternative
Streaming telemetry inverts the model. Instead of the management system requesting data, the network device pushes data to a collector at a pre-configured interval or on change. The modern standard for this is gNMI (gRPC Network Management Interface), developed by the OpenConfig working group.
YANG Data Models: The Structured Schema
At the heart of modern telemetry is YANG (Yet Another Next Generation), a data modeling language defined in RFC 7950. Unlike SNMP MIBs, which are flat and vendor-specific, YANG models define structured, hierarchical schemas that can be vendor-neutral (OpenConfig models) or vendor-specific (Cisco YANG, Juniper YANG).
A gNMI path to interface statistics on an OpenConfig model looks like a filesystem path:
/* OpenConfig gNMI path for interface counters
/interfaces/interface[name=GigabitEthernet0/0]/state/counters/in-octets
This path-based approach is immediately human-readable and programmatically navigable, unlike an SNMP OID such as 1.3.6.1.2.1.2.2.1.10.
MDT vs. Event-Driven Telemetry
A common misconception is that all streaming telemetry is the same. In reality, we distinguish between **Model-Driven Telemetry (MDT)** and **Event-Driven Telemetry (EDT)**.
Model-Driven (MDT)
MDT maps internal device data structures (counters, FIB entries, temperature sensors) to a YANG model. The device kernel pushes this raw data directly to the line card or supervisor CPU for transmission. This is highly efficient and provides granular "snapshots" of performance.
- 01 In-Kernel Extraction
- 02 Periodic interval (SAMPLE)
Event-Driven (EDT)
EDT triggers only when a specific threshold is met or a state change occurs (e.g., optical power drops below -15dBm). This is the "smarter" cousin of the SNMP Trap, but with reliable transport and rich YANG context.
- 01 Threshold Violation
- 02 State Change (ON_CHANGE)
The Modern Telemetry Pipeline Architecture
A production-grade streaming telemetry deployment follows a pipeline architecture with four distinct stages:
1. Collectors
Agents (Telegraf, gNMIc) that establish gRPC connections to network devices, receive streamed data, and normalize it into a standard format (InfluxDB line protocol, Protobuf).
2. Message Bus
Kafka or NATS provides a high-throughput, durable message queue between collectors and the processing layer, decoupling ingestion from analysis.
3. Time-Series Database
InfluxDB, TimescaleDB, or VictoriaMetrics stores the time-indexed telemetry data for querying and long-term analysis.
4. Visualization & Alerting
Grafana dashboards and alert rules consume the time-series data, providing sub-second observability metrics that are impossible with SNMP polling intervals.
Closed-Loop Automation Integration
The true power of streaming telemetry is realized when it is coupled with **Closed-Loop Automation (CLA)**. In this architecture, telemetry data serves as the feedback signal in a PID-like control system for the network.
The device streams interface counters via gNMI every 100ms. The collector detects a rapid spike in egress drops on a specific port.
An automation engine (e.g., Ansible EDA or StackStorm) evaluates the telemetry against an SLO. The drop rate exceeds 0.01%.
The engine executes a NETCONF or gNMI `SET` operation to modify the traffic shaper or reroute high-priority flows to a different path.
"Telemetery makes the network self-healing. Without the low-latency push model, automation is always chasing the ghost of a past state."
The Collector Complexity: Protobuf & High Cardinality
Implementing a telemetry collector is significantly more complex than writing an SNMP polling script. The collector must manage long-lived gRPC channels, handle certificate-based mutual authentication (mTLS), and—most importantly—perform high-speed Protobuf decoding.
Because gRPC is binary, the collector needs the exact `.proto` and YANG files used by the device to deserialize the message. If the vendor updates their software version and changes the schema, the collector may fail to parse the data unless its local proto definitions are updated. This creates a "schema versioning" challenge that was less severe in the MIB world.
Furthermore, the **High Cardinality** of telemetry data (1,000s of metrics per second across 1,000s of devices) requires a highly performant downstream storage engine. Traditional relational databases often crumble under this write-heavy load, necessitating the use of specialized time-series databases (TSDBs).
Operational Comparison Summary
| Criterion | SNMP Polling | Streaming Telemetry (gNMI) |
|---|---|---|
| Latency to detect event | Poll interval (typically 5 min) | Seconds or immediate (ON_CHANGE) |
| Transport protocol | UDP (unreliable) | TCP/gRPC (reliable, encrypted) |
| Data model | Vendor-specific MIBs | Standardized YANG (OpenConfig) |
| Security | Community strings (v2c), basic auth | mTLS, certificate-based auth |
| NMS CPU load | High (NMS drives all requests) | Low (device drives pushes) |
SNMP is a reliable workhorse for small, static networks where 5-minute polling latency is acceptable. For any network with more than a few hundred devices, high-velocity events (link flapping, microbursts), or a security posture that disallows community strings in cleartext — streaming telemetry via gNMI is the only architecture that can deliver the sub-minute observability that modern operations require. The migration investment in YANG model familiarity and pipeline tooling pays for itself the first time you detect and remediate a cascading failure before users open a ticket.
The Telemetry Encyclopedia
gNMI
gRPC Network Management Interface. The industry standard protocol for streaming telemetry and configuration management.
YANG
Yet Another Next Generation. A data modeling language used to define the hierarchical schema of the data being streamed.
Protobuf
Protocol Buffers. Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data.
MIB
Management Information Base. A hierarchical database used by SNMP to define manageable objects.
OID
Object Identifier. A numeric string (e.g., 1.3.6.1...) used to uniquely identify an object in a MIB.
Shadowing
A telemetry technique where the device maintains a 'shadow copy' of the operational state to minimize kernel-to-user space context switching.
Dial-In
A connection mode where the collector initiates the gRPC session to the network device.
Dial-Out
A connection mode where the network device initiates the gRPC session to a pre-configured collector.
Encoding: JSON_IETF
A standardized JSON representation of YANG-modeled data, often used in NETCONF/RESTCONF.
Encoding: BYTES
A raw binary encoding often used for high-throughput performance metrics.
Leaf
A terminal node in a YANG tree that contains a specific value (e.g., an interface description).
Container
A structural node in a YANG tree used to group related nodes (e.g., all interfaces).
XPath
A query language used to identify specific paths within a YANG schema (e.g., /interfaces/interface/state).
gNOI
gRPC Network Operations Interface. A set of gRPC services for operational actions like rebooting, pinging, or file transfer.
gRIBI
gRPC Routing Information Base Interface. A protocol for external controllers to program the RIB of a device.
Flow Labels
In IPv6, a field used to identify packets of a specific flow for consistent telemetry tracking.
NetFlow/IPFIX
A specialized form of telemetry focused on traffic flows rather than device counters.
sFlow
A multi-vendor standard for packet sampling, strictly push-based and high-velocity.
mTLS
Mutual TLS. A security requirement for gNMI where both the device and collector must present valid certificates.
Telegraf
A popular open-source collector agent that supports gNMI and SNMP inputs.