In a Nutshell

Kubernetes networking is notoriously complex because it operates on a 'flat' IP per-pod model — every Pod must communicate with every other Pod without NAT. The Container Networking Interface (CNI) is the standard that allows different networking providers (Calico, Flannel, Cilium) to plug into Kubernetes to handle pod-to-pod and pod-to-service communication. This article explores overlay networks, eBPF-based acceleration, MTU trade-offs, and the evolution from IPtables to eBPF-based data planes.

The Pod-to-Pod Mandate

In Kubernetes, every Pod gets its own unique IP address. The fundamental networking requirement is that any Pod must be able to communicate with any other Pod on any Node without using Network Address Translation (NAT). This 'flat network' model is deceptively simple to state but complex to implement across physical servers spanning multiple subnets and data centers.

The challenge is that the underlying physical infrastructure was never designed for this requirement. A typical data center uses a traditional routed or VLAN-based network where VMs have one IP per host, and routing between subnets requires explicit configuration. Kubernetes needs to transparently overlay a virtual pod network on top of this existing physical fabric.

How CNI Works

When a Pod is created, the Kubernetes agent (Kubelet) calls a CNI Plugin. The plugin is a binary executable that speaks the CNI specification — a simple JSON API. The plugin is responsible for:

  1. Assigning an IP address to the Pod from a pre-allocated CIDR block (e.g., 10.244.0.0/16).
  2. Creating a virtual ethernet pair (veth): one end lives inside the Pod's network namespace, the other on the host node.
  3. Updating the routing table on the host node so traffic destined for this Pod's IP routes to its veth interface.
  4. Establishing the tunnel (if using an overlay) to other nodes so cross-node Pod traffic can be forwarded.
  5. Programming any NetworkPolicy rules (IPtables or eBPF maps) that restrict traffic based on pod labels.

Pod Networking Visualizer

CNI Data Plane & Encapsulation Logic

Worker Node A (10.1.0.10)
Pod A10.244.1.2
eth0
Pod B10.244.1.3
eth0
cbr0 (Bridge) / veth-pair junction
eth0
Worker Node B (10.1.0.11)
Pod C10.244.2.2
cbr0
eth0
Pod IP Space
Node IP Space (Underlay)
Local Communication: When Pod A talks to Pod B on the same node, the traffic never leaves the Linux internal bridge (cbr0). It's purely virtual switching via veth pairs.
Overlay Networking (VXLAN): To cross nodes, the pod packet is "encapsulated" inside a regular UDP packet from Node A to Node B. This is why you see the Pod IP inside the Node IP.

Service Networking & Kube-Proxy

Pods are ephemeral — they die and restart with new IPs constantly. We use Services to provide a stable virtual IP (ClusterIP) that persists regardless of which Pod instances are running behind it. The magic of mapping a Service IP to a set of Pod IPs happens via Kube-Proxy, which runs on every node and installs IPtables or IPVS rules that perform load-balanced DNAT (Destination NAT) when traffic hits the Service ClusterIP.

The MTU Tax: A Hidden Performance Trap

When using overlay encapsulation (VXLAN), every packet gains an additional header: 8 bytes VXLAN, 8 bytes UDP, 20 bytes IP, 14 bytes Ethernet = 50 bytes of overhead per packet. On a standard 1500-byte MTU network, this leaves only 1450 bytes for actual pod payload. If the Pod MTU is not explicitly reduced to 1450, the physical host will receive 1500-byte packets that now exceed the physical MTU after encapsulation, causing silent packet fragmentation.

Conclusion

Kubernetes networking is the ultimate abstraction layer. It hides the complexity of physical routing from the application developer, but it requires the platform engineer to deeply understand the tunnels, veth interfaces, IPtables chains, and eBPF maps that make that abstraction possible. The evolution from Flannel to Calico to Cilium mirrors the broader industry shift from rule-based kernel networking to programmable, kernel-native data planes — a shift that will define cloud-native infrastructure for the next decade.

CNI Packet Flow: The Kernel Data Path

Understanding the kernel data path of a CNI plugin is essential for troubleshooting performance issues. When a pod sends a packet, it traverses a carefully orchestrated chain of kernel subsystems before reaching the physical NIC.

The packet originates from the pod's network namespace, exiting via the **veth pair** — a virtual Ethernet cable that connects the pod namespace to the host namespace. On the host side, the packet enters the **tc (traffic control)** layer. Calico's eBPF mode attaches a BPF program here that performs routing decisions without entering the iptables chain. The BPF program checks the destination IP against a local FIB (Forwarding Information Base) that is populated by Felix, Calico's policy daemon. If the destination is a pod on another node, the BPF program performs an **encap** — wrapping the original packet in an IPIP or VXLAN header with the destination node IP, then sending it to the host's routing table.

If using iptables mode (standard Calico), the packet instead traverses the **PREROUTING → FORWARD → POSTROUTING** chain. Each chain applies policy rules based on the packet's source and destination IPs, port numbers, and Kubernetes NetworkPolicy selectors. The performance cost is significant: each packet must be linearly scanned against potentially hundreds of iptables rules until a match is found. At 100Gbps line rate with 64-byte packets (148 million packets per second), the iptables rule scan alone can consume 4-6 CPU cores per node just for firewall enforcement.

Cilium's eBPF path improves this by using **BPF Map Lookups** instead of rule scanning. The packet's 5-tuple (src IP, dst IP, protocol, src port, dst port) is hashed and looked up in a pre-computed BPF hash map. This is an O(1) operation, eliminating the linear scan entirely. The result is deterministic forwarding latency at any packet rate, with per-packet processing overhead dropping from 400 nanoseconds (iptables) to 50 nanoseconds (eBPF). This 8x reduction in per-packet overhead is what enables Kubernetes clusters to saturate 100Gbps NICs without dropping packets.

IPAM Contention: CIDR Allocation and Pod Density Limits

IP Address Management (IPAM) is one of the most frequently underestimated scaling bottlenecks in Kubernetes networking. Each pod requires a unique IP address within the cluster's CIDR range, and the CNI plugin is responsible for allocating and releasing these addresses as pods are created and destroyed. The IPAM module's performance directly determines the cluster's pod launch latency, node density limits, and the risk of IP exhaustion under churn.

Kubernetes supports two primary IPAM models: **HostLocal** and **Cluster-Wide**. HostLocal IPAM (used by Flannel, Calico's default, and Cilium's default) pre-allocates a CIDR block to each node. When a pod is created on node A, the CNI plugin picks an IP from node A's pre-allocated block without communicating with any central controller. This design provides O(1) allocation latency — typically 5-15 microseconds per pod — because the allocation is a simple bitwise operation on a local bitmap. The tradeoff is IP fragmentation: if node A has a /24 block (256 IPs) but only runs 10 pods, 246 IPs are stranded and cannot be used by pods on node B, even if node B is running 200 pods and approaching its own block's limit.

Cluster-Wide IPAM (used by Calico's `kubernetes` IPAM and some custom CNIs) uses a central IP allocation store — typically the Kubernetes API server or an etcd instance — to allocate IPs from a shared pool. This eliminates fragmentation: all 65,536 IPs in a /16 cluster CIDR are available to all pods regardless of which node they run on. However, every pod creation now requires an API call to the central store, which adds 10-50ms of latency per pod and creates a potential bottleneck when launching pods in parallel. In a cluster burst-scaling 1,000 pods simultaneously, the API server must handle 1,000 IP allocation requests in rapid succession, which can trigger API server throttling and cause pod startup delays of 30-60 seconds.

The IP reuse cycle is equally important. When a pod is deleted, its IP is returned to the pool. However, the kernel's TCP stack on the old pod's node may still have connections in TIME_WAIT state associated with that IP and port combination. If the IP is immediately reassigned to a new pod, the new pod may receive TCP segments intended for the old pod's connection — a security and data corruption risk. CNI plugins implement a **Graceful IP Release** mechanism that holds the IP in a quarantine state for a configurable period (typically 30-120 seconds, matching the kernel's TIME_WAIT duration of 60 seconds by default). During this quarantine, the IP is unavailable for allocation, reducing the effective IP pool size in high-churn environments. For clusters running 10,000 pods with an average pod lifetime of 5 minutes, the churn rate is approximately 33 pod creations per second, and the quarantine mechanism can reduce the effective IP pool by up to 25%.

The scaling limit of HostLocal IPAM is reached when the per-node CIDR block size is too small to accommodate the node's pod density. In a standard cluster with a /16 global CIDR (65,536 IPs) and 100 nodes, each node receives a /24 block (256 IPs). If a node runs 200 pods (which is common with modern node sizes of 16-32 vCPUs), the node has only 56 spare IPs — insufficient headroom for rolling updates and pod churn. The solution is to increase the global CIDR to a /14 (262,144 IPs) or switch to Cluster-Wide IPAM. When choosing between the two models, the decision hinges on pod density variance across nodes: if some nodes consistently run 200 pods while others run 10, Cluster-Wide IPAM is essential; if pod density is uniform, HostLocal IPAM's lower latency is preferable.

NetworkPolicy Enforcement: iptables vs. eBPF at Scale

Kubernetes NetworkPolicy is the primary mechanism for enforcing micro-segmentation between pods. When a NetworkPolicy object is created, the CNI plugin translates its declarative rules (pod selectors, namespace selectors, IP blocks, and port ranges) into kernel-level enforcement rules. The enforcement technology — whether iptables, IPVS, nftables, or eBPF — has a dramatic impact on cluster performance, scalability, and operational complexity.

In an iptables-based CNI (Calico's default mode), each NetworkPolicy rule is translated into multiple iptables chains and rules in the `FORWARD` chain. A simple policy that allows ingress on TCP port 80 from pods with label `app: web` generates 6-8 iptables rules per policy. In a cluster with 1,000 NetworkPolicy objects, the total iptables rule count can exceed 10,000 rules. Each packet traversing the FORWARD chain must be linearly matched against all 10,000 rules until a match is found. At 100Gbps line rate (84 million small packets per second), the iptables rule scan consumes 6-8 CPU cores per node just for firewall enforcement. The `iptables-save` command, which lists all rules, takes 5-15 seconds to complete on a node with 10,000+ rules, making effective debugging nearly impossible.

eBPF-based CNIs (Cilium, Calico's eBPF mode) fundamentally change this architecture. Instead of linear rule chains, eBPF maps store policy rules in hash tables keyed by the packet's identity tuple (security identity, destination port, protocol). When a packet arrives, the eBPF program performs a single hash lookup — O(1) time — to determine whether the packet is allowed or denied. This eliminates the linear scan entirely. In Cilium's implementation, the identity tuple is a 32-bit security identity derived from the pod's label set, not the pod's IP address. This means that a pod's identity persists across rescheduling, and policy lookups remain efficient even as pods churn.

The performance difference is stark at scale. In a benchmark with 5,000 NetworkPolicy rules, iptables-based enforcement adds 150-300 microseconds of per-packet latency at P99, while eBPF-based enforcement adds 5-15 microseconds. The CPU consumption per 1 million packets per second is 0.8 core for iptables versus 0.05 core for eBPF. In a large cluster running at 10 million PPS, this translates to 8 dedicated CPU cores for iptables versus 0.5 core for eBPF — a 16x efficiency improvement. The memory footprint is also dramatically different: iptables chains consume approximately 10KB per rule (100MB for 10,000 rules), while eBPF maps consume approximately 50 bytes per rule entry (500KB for 10,000 rules).

The one remaining advantage of iptables-based enforcement is debugging transparency. Every iptables rule can be inspected with standard Linux tools (`iptables -L -n -v`), and the packet counters per rule provide immediate visibility into which policies are being hit. eBPF maps require specialized tooling (bpftool, Cilium's CLI) to inspect, which has a steeper learning curve for operations teams. However, Cilium's Hubble observability layer largely closes this gap by providing rich per-flow telemetry with policy verdict annotations. The industry trend is clear: eBPF is becoming the default choice for NetworkPolicy enforcement in new clusters, with iptables retained only for compatibility with older kernels (pre-5.10) or specialized security audit requirements that mandate rule-level packet counters.

Share Article

Technical Standards & References

CNCF (2024)
CNI Specification v1.0
VIEW OFFICIAL SOURCE
cilium.io (2024)
Cilium: BPF-based Networking and Observability
VIEW OFFICIAL SOURCE
Red Hat (2024)
Kubernetes Networking Explained
VIEW OFFICIAL SOURCE
Linux Foundation (2023)
eBPF: The Future of Linux Networking
VIEW OFFICIAL SOURCE
Mathematical models derived from standard engineering protocols. Not for human safety critical systems without redundant validation.

Related Engineering Resources