SRE for Networks
From CLI Configuration to Software Engineering
1. The Foundation: SLI, SLO, and SLA
In SRE, you cannot manage what you do not measure mathematically. Traditional operations teams treat availability as a binary: the link is either up or down. SRE breaks this down into a hierarchy of precision that allows for data-driven decision-making about where to invest engineering effort and when to halt risky changes.
- SLI (Service Level Indicator): A specific, measurable metric, e.g., "The percentage of successful TCP connections through the gateway in any 5-minute window."
- SLO (Service Level Objective): The target for the SLI, e.g., "99.99% of connections must be successful over a 30-day rolling window."
- Error Budget: The amount of failure you can tolerate. A 99.9% SLO gives you 43.8 minutes of downtime a month. If you haven't used that budget, you can move faster. If you have exhausted it, all risky changes freeze until the window resets.
- SLA (Service Level Agreement): The externally contractual commitment, typically more lenient than the internal SLO to provide engineering headroom.
Error Budget Tracker
SLO: 99.9% Reliability Target
2. Infrastructure as Code (IaC)
A Network SRE never types config t directly on a live device. All changes are made in a Git repository using declarative tools like Terraform, Ansible, or Netbox as a Source of Truth. The proposed configuration is validated against schemas, run through a linter (e.g., batfish for network config validation), and tested in a virtual lab (GNS3, Containerlab) in an automated CI/CD pipeline before being deployed to production.
This GitOps approach means that any configuration on any device can be directly traced back to a specific commit, authored by a specific engineer, reviewed by a peer, and deployed at a specific time. "Who changed this route policy?" becomes a git blame command rather than an investigation.
3. Automated Remediation and Closed-Loop Automation
If a link goes down at 3 AM, an SRE's goal is not a pager alert that wakes an on-call engineer — it is a script that resolves the issue before a human is needed. The progression from manual to fully automated is a maturity ladder:
- Read-Only Automation: Scripts that observe and report but do not act. Builds confidence in monitoring accuracy.
- Write-After-Approval: Scripts that detect a failure and propose a fix, which a human approves by clicking a button in a Slack bot or ChatOps tool.
- Closed-Loop Automation: Scripts that detect, remediate, log the action, and create a follow-up ticket — entirely without human intervention. Reserved for well-understood failure modes with high-confidence remediation runbooks.
Conclusion
SRE transforms networking from a reactive craft into a proactive engineering discipline. By measuring reliability mathematically through SLIs and SLOs, managing risk through error budgets, and automating both deployment and remediation through IaC and closed-loop systems, network engineers can operate infrastructure at a scale and speed no human team could match manually. The network stops being a liability and becomes a competitive advantage.