LRG Networks.com
LWC Training Corp.

Voice over IP Online Course

Lesson 3 - How Does VoIP Work?

C. Transmitting Voice


Voice Transmission


Once a call has been set up between two, or more, VoIP devices, the caller starts speaking. At this point the voice signal has to be converted into a digital signal, formatted for TCP/IP transmission and sent along the network to the destination, where all of the preceding steps have to be reversed.

Digitizing voice

The steps involved are,

1. Converting the voice into a digital signal
   a) Sampling
   b) Quantizing
2. Making the voice data smaller
   a) silence suppression
   b) compression

Sampling and quantizing

The first step in converting analog voice signals into digital is called sampling. The voice signal is sampled 8,000 times per second and each sample can be encoded in 8 bits. This produces a bit stream of 64,000 bits per second. This many samples are sufficient to reproduce the original sound accurately. The process of converting one sample into 8 bits is called “quantizing” because the infinite possible values of a voice sample must fit into one of 256 discrete values available for the digital byte (28=256). This process is called Pulse Code Modulation (PCM). The device that produces a digital signal from an analog one is called a codec, which is an abbreviation of code/decode. Normally a codec is embedded in a microchip called a digital signal processor (DSP).

Codec
Figure 11: The codec

PCM produces a 64 kbps stream of data with excellent voice quality. This process allowed long distance calls to be places on the T1 lines of the telephone company for transmission. One voice call takes up one channel, not a very efficient scheme. With VoIP, we want to cram as much voice data into as little digital signal as possible. And instead of diverting our digital voice signal directly onto a T1 line, we need to packetize it and send it over an IP network.

Silence suppression and compression

It has been estimated that as much as 60% of a voice conversation is silence. Deleting these empty bits decreases the amount of data needed for the voice transmission. However taking all of these empty bits out of the transmission produces an eerie, other worldly quality to the conversation. In practice, voice engineers compensate by putting some background “comfort noise” back into the conversation.

In addition to silence suppression, the digital data that represents the voice can be compressed with modern compression techniques, similar to that used for computer data.

The net effect of these techniques is to reduce the bandwidth required for a voice conversation down from 64 kbps to 32 kbps, 16 kbps, 8 kbps or even less. Eight voice conversations at 8 kbps can take place over the same circuit as a single conversation at 64 kbps.

Encoding and compression techniques are published as standards by the International Telephone Union. Expect to see these when looking at specifications for VoIP equipment. The original PCM at 64 kbps is G.711 and is always supported by VoIP equipment. Other important encoding and compression standards are as follows.

Codec G.711
PCM
G.726
ADPCM
G.728
LD-CELP
G.729A
CS-CELP
G.723.1
ACELP/MP-MLQ
kbps 64 32 16 8 5.3/6.4

Jitter buffers are memory areas used to store voice packets arriving with variable delays so that it appears that each voice sample has arrived in the same amount of time. The steady output of the voice samples from the jitter buffer is called playout. The playout is steady and constant, and as long as the jitter buffer receives an ample supply of voice packets, the system appears to have a fixed delay.

Packetizing voice

Once the voice data has been digitized, compressed and the silence suppressed, it has to be divided into sections for placing into IP packets.

VoIP is inefficient for small voice packets while large voice packets lead to long delays. The VoIP packet will have overhead in the form of headers. The headers for IP, UDP and RTP add up to 40 bytes. If the data was as small as 40 bytes, the packet would only be 50% efficient. The largest size packet that can exist on an Ethernet system is 1500 bytes. Take away the 40 bytes for the header and you still have 1460 bytes available. That translates into 1460 samples of uncompressed voice or about one fifth of a second (182ms). If it is compressed with a ratio of 1 to 8, that represents about 1.5 seconds. If a packet with this much voice is lost or arrives out of turn, the conversation will be severely disrupted.

Typically, 10ms to 30ms (average 20ms) of voice is placed inside one packet. 20ms of uncompressed voice takes 160 bytes. Compressed at 4 to 1, 20ms would take 40 bytes. The amount of voice carried inside one packet is a trade-off between the need for efficiency and the need to smooth out a conversation if a packet is lost in transit.

The voice packet
Figure 12: The voice packet

Transmission of voice by IP

The three protocols of the TCP/IP protocol suite used by the voice data are Real-time Transport Protocol (RTP), User Datagram Protocol (UDP) and the Internetworking Protocol (IP).

Why UDP and not TCP?

TCP is used for call control and setup but UDP is used for the voice transmission itself. TCP is the protocol used when guaranteed delivery of a packet is required. If a packet is lost, TCP provides a re-transmission mechanism that continues to transmit the same data until it is finally received. This same mechanism makes TCP unsuitable for voice transmission. Re-transmission of a lost packet will introduce a gap in the conversation. It is better that one packet, which represents typically 20ms of conversation, stays lost than that the conversation is interrupted. UDP does not provide a re-transmission facility and is therefore the protocol of choice for voice transmission.

[Top of page][On to next section]






Menu