VoIP Performance: RTP & SIP Analysis

1. SIP vs. RTP: The Control and the Media

Every VoIP call consists of two entirely different types of traffic operating at different layers with different protocol requirements. Understanding this separation is the first step to effective VoIP troubleshooting: a problem in call setup is a SIP issue; a problem in call quality during a conversation is an RTP/network issue.

SIP (Session Initiation Protocol): The "Control Plane". Like a telephone operator, it locates the callee (via DNS SRV records to a SIP proxy), negotiates the media parameters (codecs, sample rates) through SDP (Session Description Protocol), and establishes the session. Typically runs over TCP or UDP port 5060, or TLS on 5061.
RTP (Real-time Transport Protocol): The "Data Plane". Once SIP completes the handshake and both endpoints agree on a codec and port pair via SDP, SIP "gets out of the way" and the two phones exchange raw audio data directly using RTP over UDP on dynamically negotiated high ports (typically 16384•ô32768).

2. Codecs & The Math of Packetization

The codec negotiated during SIP/SDP determines the packet size, bandwidth consumption, and the Algorithmic Delay.

Bandwidth Calculation (L2 Overhead)

A common engineering mistake is only accounting for the codec's payload bitrate. In reality, the Encapsulation Overhead (L2/L3/L4) is significant.

BW_{Total} = \frac{PayloadSize + Headers(40B + L2)}{PacketTime}

G.711 (20ms)

Total BW: $\approx 87.2$ Kbps

(Includes Ethernet 18B, IP 20B, UDP 8B, RTP 12B)

G.729 (20ms)

Total BW: $\approx 31.2$ Kbps

(Despite the 8 Kbps payload, overhead is 240%!)

G.711: The gold standard for fidelity ( $R \approx 94$ ). Uses Pulse Code Modulation (PCM). Minimal CPU overhead but highest bandwidth.
Opus: The "Swiss Army Knife" ( $R \le 100$ ). It uses Linear Prediction (LP) for speech and MDCT for music. Its Forward Error Correction (FEC) mode allows the packet to contain a lower-bitrate copy of the previous packet, enabling seamless recovery from 20% random loss.

3. The E-Model (ITU G.107): Predicting MOS

The industry standard for calculating VoIP quality is the E-Model. It produces a transmission rating factor ($R$) which maps to the MOS score.

R = R_o - I_s - I_d - I_{ef,ext} + A

R_o: Basic Signal-to-Noise ratio ( $\approx 94$ ).
I_s: Simultaneous impairment (loudness, quantizing).
I_d: Delay impairment (Talker overlap, Echo).

$I_{ef,ext}$ : Equipment impairment (Packet loss/Codec choice).
A: Advantage factor (User's patience with mobile/satellite links).

The MOS Mapping Function

Once we have $R$ , we map it to the 1-5 MOS scale:

MOS = 1 + 0.035R + R(R - 60)(100 - R) \cdot 7 \times 10^{-6}

4. PLC: Healing the Stream

When an RTP packet is lost, the decoder doesn't just play silence. It uses Packet Loss Concealment (PLC) to bridge the gap.

Zero Stuffing: (Low quality) Just fills the gap with silence. Causes a "pop" sound.
Waveform Substitution: Copies the previous 20ms of audio and fades it in. Works well for steady vowels but fails on percussive sounds (T, P, K).
LPC Interpolation: (High quality) Used in G.729 and Opus. The decoder uses the vocal tract model parameters from previous packets to synthesize a "plausible" next segment of speech.

RTP Jitter Buffer Simulation

Observe how network variance (jitter) impacts voice packet delivery

Network Jitter Level

RTP Packets Played0

Dropped / Late0

Audio Stream Stable

The jitter buffer contains enough packets to maintain a smooth audio playback rate.

Sender

WAN Transit

Jitter Buffer

0 pkts

Endpoint

Common Issues: NAT Traversal and One-Way Audio

The most common VoIP problem is "One-Way Audio" — User A can hear User B, but not vice versa. This symptom almost always points to a NAT/Firewall traversal failure.

The root cause: SIP's SDP body contains the sender's private IP address as the RTP destination. When the SIP packet crosses a NAT gateway, the outer IP header is translated, but the SDP body (application-layer payload) is not — it still contains the private RFC 1918 address. The remote party sends RTP to an unreachable private IP and the audio is lost in one direction.

Solutions include STUN/TURN/ICE (WebRTC standard), SIP ALG on the firewall (often counterproductive and best disabled), or a Session Border Controller (SBC) that acts as a media relay with full NAT awareness.

6. Case Study: The WMM Priority Inversion

In a high-density Wi-Fi 6 environment, we encountered "choppy audio" even though voice packets were marked with DSCP 46 (EF).

The Forensic Discovery

We discovered QoS Priority Inversion. While the wired network respected the EF tag, the Wi-Fi AP's EDCA (Enhanced Distributed Channel Access) queue for Voice (AC_VO) was being overwhelmed by high-throughput clients using the Best Effort (AC_BE) queue.

Because the AC_BE clients were using 160MHz channels and long TXOPs (Transmit Opportunities), the VoIP clients (using 20MHz legacy modes) couldn't win the contention for the airwaves, causing "Channel Contention Jitter" that exceeded 100ms.

Resolution: Enabling Airtime Fairness and limiting the maximum TXOP for data clients ensured the small VoIP packets could "cut in line" as intended by the 802.11e standard.

Opus Codec: Adaptive Bitrate Under Constrained Latency

The Opus codec (RFC 6716) is the dominant audio codec for modern VoIP and WebRTC, covering the full range from 6 kbps narrowband speech to 510 kbps full-band stereo music. What distinguishes Opus from legacy codecs (G.711, G.729, AMR) is its Adaptive Bitrate (ABR) mechanism, which adjusts the encoding bitrate in 500 bps increments based on real-time network conditions. Opus uses a hybrid of SILK (speech-optimized LPC-based coding) and CELT (constrained-energy lapped transform) to achieve its wide dynamic range. The switching between SILK and CELT modes occurs at approximately 24 kbps: below this threshold, only SILK is used (optimized for intelligibility, not fidelity); above it, CELT progressively takes over:

B_{opus} = \min\left(B_{max},\; \frac{1.2 \cdot R_{avg}}{\text{PLR} + 0.05}\right)

B_{max}Maximum allowed bitrate

R_{avg}Average throughput measured over 5 seconds

PLRPacket Loss Rate (0 to 1)

When packet loss exceeds 5%, Opus's ABR algorithm reduces the codec bitrate to provide more forward error correction (FEC) within the same RTP payload. The Opus FEC layer transmits a redundant, lower-bitrate copy of the previous frame alongside each current frame, allowing the receiver to conceal a 100ms loss burst with minimal audible degradation. The trade-off is bandwidth: FEC consumes an additional 20-30% of the base bitrate. In WebRTC, the receiver's REMB (Receiver Estimated Maximum Bitrate) feedback message informs the sender of the network's current capacity, creating a closed-loop control system where the codec bitrate tracks the available bandwidth within 1-2 seconds of convergence time.

Selective Forwarding Unit: The Scalability Architecture

Modern video conferencing systems (Zoom, Google Meet, Jitsi) do not use the mesh topology of early VoIP—where every participant sends their stream to every other participant, requiring N² uplink bandwidth. Instead, they use a Selective Forwarding Unit (SFU), which is a centralized server that receives each participant's stream once and selectively forwards it to the other participants. The SFU decouples the sender's bitrate from the receiver's available bandwidth: a single sender can transmit a high-quality 4K stream, and the SFU can transcode it to 720p for mobile users and 1080p for desktop users, each in their own forward path:

R_{SFU} = \sum_{i=1}^{N} R_{s,i} + \sum_{i=1}^{N} \sum_{j \neq i} R_{f,ij}

R_{s,i}Inbound bitrate from participant i

R_{f,ij}Outbound bitrate from participant i to participant j

NNumber of participants

The SFU's critical performance bottleneck is the Packet Processing Pipeline: each incoming RTP packet must be decrypted (SRTP), inspected for sequence number continuity, placed into a per-participant output queue, and re-encrypted for each downstream receiver. At 50 participants with 1080p/30fps streams, the SFU processes approximately 150,000 RTP packets per second. The queuing delay within the SFU's forwarding logic—caused by head-of-line blocking in the decryption stage—is the dominant source of SFU-Induced Jitter. Modern SFU implementations use lock-free ring buffers and SIMD-accelerated AES-GCM decryption to keep the per-packet processing below 500 nanoseconds, ensuring that the SFU adds no more than 1 millisecond of jitter to the end-to-end path.

Conclusion

VoIP engineering is the discipline of minimizing jitter and protecting real-time flows from the inherent unpredictability of packet-switched networks. By understanding the separation of SIP and RTP, selecting the appropriate codec for the link budget, tuning QoS markings, and deploying proper NAT traversal infrastructure, engineers can deliver consistent, toll-quality voice on the same infrastructure that carries best-effort internet traffic.

Engineering Knowledge Expansion

Theory

VoIP Performance Analysis

In a Nutshell