In a Nutshell

VoIP is one of the most network-sensitive applications. A single packet lost in the middle of a sentence can make the communication unintelligible. VoIP splits the call into two distinct planes: SIP for finding and connecting the user, and RTP for the actual voice data. This article analyzes how these protocols work together and how to debug performance bottlenecks.

1. SIP vs. RTP: The Control and the Media

Every VoIP call consists of two entirely different types of traffic operating at different layers with different protocol requirements. Understanding this separation is the first step to effective VoIP troubleshooting: a problem in call setup is a SIP issue; a problem in call quality during a conversation is an RTP/network issue.

  • SIP (Session Initiation Protocol): The "Control Plane". Like a telephone operator, it locates the callee (via DNS SRV records to a SIP proxy), negotiates the media parameters (codecs, sample rates) through SDP (Session Description Protocol), and establishes the session. Typically runs over TCP or UDP port 5060, or TLS on 5061.
  • RTP (Real-time Transport Protocol): The "Data Plane". Once SIP completes the handshake and both endpoints agree on a codec and port pair via SDP, SIP "gets out of the way" and the two phones exchange raw audio data directly using RTP over UDP on dynamically negotiated high ports (typically 16384•ô32768).

2. Codecs: The Compression-Quality Trade-off

The codec negotiated during SIP/SDP determines the packet size, bandwidth consumption, and call quality ceiling:

  • G.711 (PCMU/PCMA): Uncompressed, 64 Kbps, highest quality, lowest computation. Standard for enterprise telephony over LAN. Each 20ms RTP packet carries 160 bytes of payload plus 40 bytes of IP/UDP/RTP headers.
  • G.729: Highly compressed, 8 Kbps, good quality for WAN links. Requires a DSP or software codec license. Small bandwidth footprint makes it ideal for limited-bandwidth branches.
  • Opus: Modern adaptive codec (6•ô510 Kbps), used in WebRTC. Dynamically adjusts bitrate based on packet loss and network conditions. The codec of choice for modern UCaaS platforms.

3. Metrics of Quality: MOS and the ITU G.107 E-Model

We measure VoIP quality using the MOS (Mean Opinion Score), a perceptual scale from 1 (unintelligible) to 5 (toll quality). The ITU G.107 E-Model computes MOS algorithmically from measurable network parameters:

  1. Latency (One-Way): Must be below 150ms. Above 200ms, conversational dynamics break down — people start talking over each other due to the perceptible echo of their own voice.
  2. Jitter: The variation in packet inter-arrival time. Must be below 30ms. Jitter caused by variable queue delays in the network is smoothed by the dejitter buffer at the cost of additional playback latency.
  3. Packet Loss: Anything above 1% causes noticeable audible gaps. At 5%, the MOS score drops below 3.5, which is the minimum acceptable threshold for enterprise business calls.

RTP Jitter Buffer Simulation

Observe how network variance (jitter) impacts voice packet delivery

Network Jitter Level

RTP Packets Played0
Dropped / Late0

Audio Stream Stable

The jitter buffer contains enough packets to maintain a smooth audio playback rate.

Sender
WAN Transit
Jitter Buffer
0 pkts
Endpoint

Common Issues: NAT Traversal and One-Way Audio

The most common VoIP problem is "One-Way Audio" — User A can hear User B, but not vice versa. This symptom almost always points to a NAT/Firewall traversal failure.

The root cause: SIP's SDP body contains the sender's private IP address as the RTP destination. When the SIP packet crosses a NAT gateway, the outer IP header is translated, but the SDP body (application-layer payload) is not — it still contains the private RFC 1918 address. The remote party sends RTP to an unreachable private IP and the audio is lost in one direction.

Solutions include STUN/TURN/ICE (WebRTC standard), SIP ALG on the firewall (often counterproductive and best disabled), or a Session Border Controller (SBC) that acts as a media relay with full NAT awareness.

Conclusion

VoIP engineering is the discipline of minimizing jitter and protecting real-time flows from the inherent unpredictability of packet-switched networks. By understanding the separation of SIP and RTP, selecting the appropriate codec for the link budget, tuning QoS markings, and deploying proper NAT traversal infrastructure, engineers can deliver consistent, toll-quality voice on the same infrastructure that carries best-effort internet traffic.

Share Article

Technical Standards & References

ITU-T (1988)
ITU-T G.711: PCM Audio Codec
VIEW OFFICIAL SOURCE
Schulzrinne, H., et al. (2003)
RTP: A Transport Protocol for Real-Time Applications (RFC 3550)
VIEW OFFICIAL SOURCE
Rix, A.W., et al. (2001)
MOS Score and VoIP Quality Assessment
VIEW OFFICIAL SOURCE
Holland, O., et al. (2020)
VoIP Latency Budget and Optimization
VIEW OFFICIAL SOURCE
Mathematical models derived from standard engineering protocols. Not for human safety critical systems without redundant validation.

Related Engineering Resources