Why does asterisk (and other systems) feel the need to regenerate in-call DTMF

asterisktelephony

I am using Asterisk to interact with analog telephony devices that can be programmed and tested with DTMF interaction.

Some of these guys speak rather quickly. Too quickly, you could convincingly argue; I'd be right with you there. And yet, Asterisk is perfectly capable of hearing the tones, and if I'm lucky enough to get a pure stream with in-band DTMF audio, I can recognize even really fast tones very succesfully.

The problem arises when Asterisk (or another telephony system) decides it needs to recognize and regenerate the DTMF. I realize that this is important to do when translating e.g. to/from out-of-band DTMF, but I'm not sure why it seems to be the default action to do this, and in particular why it is often regenerated with lengthy durations (e.g. 100ms; thankfully, in Asterisk, this can be changed, although it can involve a recompile) that is almost guaranteed to mean loss of digits. Others have reported issues where in-band conversion to out-of-band has resulted in duplicated digits, even though the conversion was not necessary.

So my question is: why is this the M.O. for telephony systems? Why not leave in-call DTMF alone unless translation is explicitly required?

Best Answer

Take a high-fidelity CD recording of your favorite song.
Record it using the cheapest microphone you can find.
Encode the recording with a lousy 8-bit audio codec optimized for spoken words.
Play the recording back through a cheap speaker (and wiggle the wires).

If you listen to the CD and the chain above side-by-side you'll hear how badly mangled things get in telephony. Now imagine that instead of a song you recorded DTMF tones and were trying to play them back and get a computer to recognize them.

This is why most VoIP systems re-encode DTMF tones using an out-of-band channel (like RFC 2833) -- the compression, network jitter, latency, and potential packet loss make audio-encoded DTMF prone to failure.
By sending the DTMF tones as out-of-band data they can be reinserted into the audio stream at the endpoint closest to the PSTN, minimizing the risk that the tones will be mangled.

Why 100ms? Because some telephone lines or remote ends have trouble with shorter tone durations (if you've ever called a touch-tone system over a noisy land line you've probably held a button for a few seconds in frustration to get the system to recognize the tone).
(100ms is probably too long - 20-50ms is more than adequate)


You don't have to use out-of-band signaling -- moist VoIP systems will deal with in-band signaling (you typically have to set a parameter on your phone and your server to do so, and you must use high-quality codecs (or disable compression entirely if you want a real shot at reliability).
Most people deploying them elect to use RFC 2833 (and re-encode DTMF received in-band) instead because it is substantially more reliable.