Capturing Audio Data FAQs
VoiceAI is a powerful engine that turns audio into a rich data stream for use in upstream applications such as analytics or CRM systems. The better the quality of the audio, the more information can be retrieved, and the higher accuracy of the resulting data output.
VoiceAI can process both kinds of phone calls – live and recorded.
In the case of live calls, each phone call is split at its origin into two separate channels (one for each speaker) and then processed with a delay below a fraction of a second. The processing is limited by the number of available ports. One VoiceAI server can simultaneously process thousands of conversations per cluster divided between two speakers. The solution can be scaled by multiplying the numbers of active servers.
In the case of recorded files, the most important condition for successful processing is the correct format of the audio files. Recorded phone calls can be processed at a rate of approximately 120,000 minutes per day. It must be noted that the processing time can differ depending on how many of the calls contain speech, what’s the speech VS silence ratio or how many utterances there are per minute.
VoiceAI will accept these and most likely produce satisfactory results. However they will be a ‘par under’ in comparison with the ‘ideal’ format. The reason behind it is simple: lower audio quality affects the accuracy of the speech to text transcription, the basis of the successful analysis.
And last (as well as least), are the file formats that require re-coding in order to be digestible to the VoiceAI engine. Due to a low quality of audio, we cannot guarantee a successful outcome:
G.729 8 bit
MSADPCM 4 bit, 8 Khz, Mono
MSADPCM 4 bit, 8 Khz, Stereo
MSADPCM 4 bit, 16 Khz, Mono
MSADPCM 4 bit, 16 kHz, Stereo
These are the least recommended formats. During the process of re-coding valuable data is lost which affects the credibility of the data output. In this case, accuracy falls below 60%.
Below, you will find a list of all formats that VoiceAI supports, along with the types that we are unable to process.
In short, ideal audio is supplied with:
Split channel or Stereo (speaker in each channel)
No compression of the audio files
There are two preferred file formats ideal for VoiceAI processing:
PCM8 8 bit, 8 kHz, Stereo
G.711 8 bit, 8 kHz, Stereo
These allow for the fastest, most accurate treatment of audio files. There are also file formats that are compatible with VoiceAI engine but produce an output that can be affected by the lower audio quality resulting from compression:
PCM8 8 bit, 8 kHz, Mono
PCM16 16 bit, 8 kHz, Mono
PCM16 16 bit, 8 kHz, Stereo
GSM FR 8 Khz, Mono
G.711 8 bit, 8 kHz, Mono
G.723 VAD 8 bit, 8 kHz, Stereo
PCM8 8 bit, 16 kHz, Stereo
PCM16 16 bit, 16 kHz, Mono
PCM16 16 bit, 16 kHz, Stereo
G.711 8 bit, 16 kHz, Mono
G.711 8 bit, 16 kHz, Stereo
Audio compression codecs are “lossy”, meaning that they discard the information. For example, GSM FR 16 (GSM6.10) encodes audio at a rate of 13 Kbps (bps=bits per second). By comparison, typical telephone audio is encoded with G.711 at a rate of 64 Kbps. Relative to G.711, nearly 80% of the information has been eliminated from the speech signal in GSM FR 16 encoded audio. We can work with bit rates below 64 Kbps, but transcription accuracy drops as audio bit rate drops. For a rate as low as 13 Kbps, it is easy to hear metallic ringing compression artifacts in the audio, which makes it difficult even for humans to understand what is being said where noise or other distortions (muffling, low amplitude, etc.) exist.
Although the stereo files are a better option, mono files can be adapted by undergoing the process of Diarization. This exercise splits one audio channel into two.
This enables independent analysis of agent and caller sides of the conversation, as well as the ability to identify “speaker turns”. Both of these are useful in follow-on analytics. However, this process is not perfect. Compression, noise, and similar sounding speakers compound errors in the diarization process.
The benefit of 2-channel recording is an ability easy to detect overtalk, and even more importantly, to identify when the agent initiates overtalk, is this is one aspect of agent behaviour that is consistently correlated with degrading caller sentiment. It is valuable to identify agents who are doing this for targeted training.
Call Journey CI employs noise detection and removal algorithms to eliminate these from the signal before sending the audio down the pipeline for further processing. This process is not perfect, specially with highly compressed audio where it is harder to distinguish between speech and noise. Audio should first be sampled to determine if it contains any of these non-speech elements.
With some audio, it is clear that passwords, account numbers, etc. have been redacted from the audio. In other cases, the signal may be cutting in and out somewhat randomly. It’s not clear if these are related. Both of these impact accuracy as they sometimes cut words in half, and in general result in unnatural word adjacencies.
VoiceAI is a native solution and doesn’t require the use of a third party. The audio files are accessible solely by Call Journey. Because the VoiceAI Ecosystem is a stand-alone server (not connected to the internet), there is also no possibility of a data leak. Signing a Non- Disclosure Agreement before undertaking the processing of audio is part of our policy. At Call Journey we regularly deal with sensitive data, adhering to the highest industry standards is our priority. Cleansed audio files can also be processed, as long as the audio quality hasn’t been affected.
Call Journey can assist customers to get the highest quality uncompressed audio available from their systems, which often involves improving the configuration settings of the recording solution. We only need access to high quality audio for a short period of time, after which it can be compressed as much as desired, so integration with our solution has little or no impact on audio storage space requirements.