VoIP Performance Analysis
The Engineering of Real-Time Voice
1. SIP vs. RTP: The Control and the Media
Every VoIP call consists of two entirely different types of traffic operating at different layers with different protocol requirements. Understanding this separation is the first step to effective VoIP troubleshooting: a problem in call setup is a SIP issue; a problem in call quality during a conversation is an RTP/network issue.
- SIP (Session Initiation Protocol): The "Control Plane". Like a telephone operator, it locates the callee (via DNS SRV records to a SIP proxy), negotiates the media parameters (codecs, sample rates) through SDP (Session Description Protocol), and establishes the session. Typically runs over TCP or UDP port 5060, or TLS on 5061.
- RTP (Real-time Transport Protocol): The "Data Plane". Once SIP completes the handshake and both endpoints agree on a codec and port pair via SDP, SIP "gets out of the way" and the two phones exchange raw audio data directly using RTP over UDP on dynamically negotiated high ports (typically 16384•ô32768).
2. Codecs: The Compression-Quality Trade-off
The codec negotiated during SIP/SDP determines the packet size, bandwidth consumption, and call quality ceiling:
- G.711 (PCMU/PCMA): Uncompressed, 64 Kbps, highest quality, lowest computation. Standard for enterprise telephony over LAN. Each 20ms RTP packet carries 160 bytes of payload plus 40 bytes of IP/UDP/RTP headers.
- G.729: Highly compressed, 8 Kbps, good quality for WAN links. Requires a DSP or software codec license. Small bandwidth footprint makes it ideal for limited-bandwidth branches.
- Opus: Modern adaptive codec (6•ô510 Kbps), used in WebRTC. Dynamically adjusts bitrate based on packet loss and network conditions. The codec of choice for modern UCaaS platforms.
3. Metrics of Quality: MOS and the ITU G.107 E-Model
We measure VoIP quality using the MOS (Mean Opinion Score), a perceptual scale from 1 (unintelligible) to 5 (toll quality). The ITU G.107 E-Model computes MOS algorithmically from measurable network parameters:
- Latency (One-Way): Must be below 150ms. Above 200ms, conversational dynamics break down — people start talking over each other due to the perceptible echo of their own voice.
- Jitter: The variation in packet inter-arrival time. Must be below 30ms. Jitter caused by variable queue delays in the network is smoothed by the dejitter buffer at the cost of additional playback latency.
- Packet Loss: Anything above 1% causes noticeable audible gaps. At 5%, the MOS score drops below 3.5, which is the minimum acceptable threshold for enterprise business calls.
RTP Jitter Buffer Simulation
Observe how network variance (jitter) impacts voice packet delivery
Network Jitter Level
Audio Stream Stable
The jitter buffer contains enough packets to maintain a smooth audio playback rate.
Common Issues: NAT Traversal and One-Way Audio
The most common VoIP problem is "One-Way Audio" — User A can hear User B, but not vice versa. This symptom almost always points to a NAT/Firewall traversal failure.
The root cause: SIP's SDP body contains the sender's private IP address as the RTP destination. When the SIP packet crosses a NAT gateway, the outer IP header is translated, but the SDP body (application-layer payload) is not — it still contains the private RFC 1918 address. The remote party sends RTP to an unreachable private IP and the audio is lost in one direction.
Solutions include STUN/TURN/ICE (WebRTC standard), SIP ALG on the firewall (often counterproductive and best disabled), or a Session Border Controller (SBC) that acts as a media relay with full NAT awareness.
Conclusion
VoIP engineering is the discipline of minimizing jitter and protecting real-time flows from the inherent unpredictability of packet-switched networks. By understanding the separation of SIP and RTP, selecting the appropriate codec for the link budget, tuning QoS markings, and deploying proper NAT traversal infrastructure, engineers can deliver consistent, toll-quality voice on the same infrastructure that carries best-effort internet traffic.