To get the most from Call Journey’s VoiceAI services, it helps to understand the variables that affect transcription accuracy. Audio is the primary input to an ASR system; therefore, the quality of the audio files has a significant impact on transcription accuracy. In general, the best audio recording practices are to:
- Unencrypted, open source audio codex
- Use dual-channel, speaker separated (stereo) audio.
- Use high quality telephony and recording equipment.
- Minimize surrounding noise in environments where audio is recorded.
- Record high quality audio with codes that are optimized for speech.
- Avoid lossy transcoding and compression, as with audio in MP3 format
1. Single-channel (Mono) versus Channel-separated audio
- In mono audio, all speakers are recorded on a single channel.
- In channel-separated audio, each speaker is isolated to a distinct channel.
Channel-separated audio makes it possible to transcribe each channel independently and maintain a perfect correspondence between the person speaking and the words spoken. For analytic purposes, it is important to have each speaker on a separate channel.
For example, channel-separated audio not only decouples overtalk from overall accuracy, but it also allows for an objective measurement of the overtalk in calls. However, in single-channel (mono) audio, the greater the overtalk, the lower overall accuracy will be.
Call Journey employs a process called diarization of mono audio to separate speakers into separate channels. The effectiveness of diarization is decreased when source audio includes hold music, voice recordings, or more than two speakers. Overtalk may also negatively impact diarization accuracy. However, for typical agent and caller situations with only two speakers, diarization is very effective for separating speakers to their own channels for enhanced analytics.
Audio codecs that are not optimized for speech are more damaging to the speech signal and can reduce transcription accuracy. It is important to choose your codecs wisely. The following list describes a variety of audio codecs and their implications on transcription accuracy.
- G.711 µ-law / A-law (best option)
- 64 kbps per channel - low-compression, speech-optimized codec
- G.279A/B (average option)
- 8 kbps per channel - aggressive compression, but optimized for speech
- G.723.1 (poor option)
- 6 kbps per channel - highly compressed with poor quality
Alternative audio codec options:
- Newer, speech-optimized codec providing good results at 32 kbps per channel and higher
- MP3 is a lossy codec optimized for music, not speech. If MP3 is being used for transcription, VocieAI recommends using at least 32 kbps per channel.
3. Transcoding and Compressing Audio
Transcoding is the direct digital-to-digital conversion of data from one encoding to another. Transcoding audio files can help conserve disk space and shorten the time it takes to transfer files. The following list describes the four transcode types, each of which has different implications on transcription accuracy.
3.1. Lossless-to-lossless (recommended)
No audio information is lost during the lossless-to-lossless transcoding process. Converting from a PCM WAV file to a FLAC file is an example of lossless transcoding, commonly used for saving disk space without compromising on quality. A 10-minute, mono WAV file at 8-bit/16 kHz is 9.8 MB, whereas the same file after FLAC conversion is 5.6 MB.
Lossless-to-lossy transcoding eliminates information from the audio signal that is less important for human speech comprehension. However, this loss of information negatively impacts ASR performance. Therefore, this form of transcoding is not recommended.
Using any form of to-lossy transcoding will decrease quality. Lossy-to-lossy is even worse because repeated lossy transcoding will cause a progressive loss of quality with each successive transcoding pass. This is known as "digital generation loss" or "destructive transcoding" and is irreversible.
Transcoding from lossy-to-lossless is strongly discouraged. The quality of the audio file does not improve and the file size will increase.
4. Miscellaneous Noises
Miscellaneous noises in recorded audio can come from various sources such as TVs, music, nearby people, and so on. VoiceAI employs noise detection techniques to eliminate noises from the signal before sending the audio for further processing.
The noise removal process is not perfect. Audio sources that share the same frequency range as speech are a bit more challenging. In a call center environment, it is not uncommon for nearby call center agents to be recorded. If non-speech elements remain in the audio at transcription time, the overall transcription accuracy will be negatively affected.
The following best practices ensure minimal noise is recorded:
- Use headsets with high-quality near-field microphones
- Make the recording workspace as quiet as possible
- Use high quality telephony and recording equipment
5.Additional Accuracy Improvements
For additional accuracy improvements, Call Journey recommends tuning methods such as substitutions or custom language models. Substitution is an ASR feature that can automatically correct recurring transcription errors.
Custom language models add new words to the dictionary and capture statistical properties of speech that are specific to customer use cases. Custom language models can be combined with substitutions to deliver even higher levels of accuracy.
“We at Maintrax have been involved in over 300 different speech analytics engagements over the last decade and have worked with many key speech technologies.
We are big fans of Call Journey’s platform and have been watching them with interest over the last couple of years.
From our perspective, in comparison to various platforms , Call Journey’s VoiceAI Ecosystem is very accurate and their architecture drives both speed and adaptability for integration.”