VoIP Performance Analysis
The Engineering of Real-Time Voice
1. SIP vs. RTP: The Control and the Media
Every VoIP call consists of two entirely different types of traffic operating at different layers with different protocol requirements. Understanding this separation is the first step to effective VoIP troubleshooting: a problem in call setup is a SIP issue; a problem in call quality during a conversation is an RTP/network issue.
- SIP (Session Initiation Protocol): The "Control Plane". Like a telephone operator, it locates the callee (via DNS SRV records to a SIP proxy), negotiates the media parameters (codecs, sample rates) through SDP (Session Description Protocol), and establishes the session. Typically runs over TCP or UDP port 5060, or TLS on 5061.
- RTP (Real-time Transport Protocol): The "Data Plane". Once SIP completes the handshake and both endpoints agree on a codec and port pair via SDP, SIP "gets out of the way" and the two phones exchange raw audio data directly using RTP over UDP on dynamically negotiated high ports (typically 16384•ô32768).
2. Codecs & The Math of Packetization
The codec negotiated during SIP/SDP determines the packet size, bandwidth consumption, and the Algorithmic Delay.
Bandwidth Calculation (L2 Overhead)
A common engineering mistake is only accounting for the codec's payload bitrate. In reality, the Encapsulation Overhead (L2/L3/L4) is significant.
G.711 (20ms)
Total BW: Kbps
(Includes Ethernet 18B, IP 20B, UDP 8B, RTP 12B)
G.729 (20ms)
Total BW: Kbps
(Despite the 8 Kbps payload, overhead is 240%!)
- G.711: The gold standard for fidelity (). Uses Pulse Code Modulation (PCM). Minimal CPU overhead but highest bandwidth.
- Opus: The "Swiss Army Knife" (). It uses Linear Prediction (LP) for speech and MDCT for music. Its Forward Error Correction (FEC) mode allows the packet to contain a lower-bitrate copy of the previous packet, enabling seamless recovery from 20% random loss.
3. The E-Model (ITU G.107): Predicting MOS
The industry standard for calculating VoIP quality is the E-Model. It produces a transmission rating factor ($R$) which maps to the MOS score.
- R_o: Basic Signal-to-Noise ratio ().
- I_s: Simultaneous impairment (loudness, quantizing).
- I_d: Delay impairment (Talker overlap, Echo).
- : Equipment impairment (Packet loss/Codec choice).
- A: Advantage factor (User's patience with mobile/satellite links).
The MOS Mapping Function
Once we have , we map it to the 1-5 MOS scale:
4. PLC: Healing the Stream
When an RTP packet is lost, the decoder doesn't just play silence. It uses Packet Loss Concealment (PLC) to bridge the gap.
- Zero Stuffing: (Low quality) Just fills the gap with silence. Causes a "pop" sound.
- Waveform Substitution: Copies the previous 20ms of audio and fades it in. Works well for steady vowels but fails on percussive sounds (T, P, K).
- LPC Interpolation: (High quality) Used in G.729 and Opus. The decoder uses the vocal tract model parameters from previous packets to synthesize a "plausible" next segment of speech.
RTP Jitter Buffer Simulation
Observe how network variance (jitter) impacts voice packet delivery
Network Jitter Level
Audio Stream Stable
The jitter buffer contains enough packets to maintain a smooth audio playback rate.
Common Issues: NAT Traversal and One-Way Audio
The most common VoIP problem is "One-Way Audio" — User A can hear User B, but not vice versa. This symptom almost always points to a NAT/Firewall traversal failure.
The root cause: SIP's SDP body contains the sender's private IP address as the RTP destination. When the SIP packet crosses a NAT gateway, the outer IP header is translated, but the SDP body (application-layer payload) is not — it still contains the private RFC 1918 address. The remote party sends RTP to an unreachable private IP and the audio is lost in one direction.
Solutions include STUN/TURN/ICE (WebRTC standard), SIP ALG on the firewall (often counterproductive and best disabled), or a Session Border Controller (SBC) that acts as a media relay with full NAT awareness.
6. Case Study: The WMM Priority Inversion
In a high-density Wi-Fi 6 environment, we encountered "choppy audio" even though voice packets were marked with DSCP 46 (EF).
The Forensic Discovery
We discovered QoS Priority Inversion. While the wired network respected the EF tag, the Wi-Fi AP's EDCA (Enhanced Distributed Channel Access) queue for Voice (AC_VO) was being overwhelmed by high-throughput clients using the Best Effort (AC_BE) queue.
Because the AC_BE clients were using 160MHz channels and long TXOPs (Transmit Opportunities), the VoIP clients (using 20MHz legacy modes) couldn't win the contention for the airwaves, causing "Channel Contention Jitter" that exceeded 100ms.
Conclusion
VoIP engineering is the discipline of minimizing jitter and protecting real-time flows from the inherent unpredictability of packet-switched networks. By understanding the separation of SIP and RTP, selecting the appropriate codec for the link budget, tuning QoS markings, and deploying proper NAT traversal infrastructure, engineers can deliver consistent, toll-quality voice on the same infrastructure that carries best-effort internet traffic.