Capturing Audio Data FAQs

VoiceAI is a powerful engine that turns audio into a rich data stream for use in upstream applications such as analytics or CRM systems. The better the quality of the audio, the more information can be retrieved, and the higher accuracy of the resulting data output.

What phone calls can be processed - recorded or live?

VoiceAI can process both kinds of phone calls – live and recorded.

In the case of live calls, each phone call is split at its origin into two separate channels (one for each speaker) and then processed with a delay below a fraction of a second. The processing is limited by the number of available ports. One VoiceAI server can simultaneously process thousands of conversations per cluster divided between two speakers. The solution can be scaled by multiplying the numbers of active servers.

In the case of recorded files, the most important condition for successful processing is the correct format of the audio files. Recorded phone calls can be processed at a rate of approximately 120,000 minutes per day. It must be noted that the processing time can differ depending on how many of the calls contain speech, what’s the speech VS silence ratio or how many utterances there are per minute.

What format is suitable for Voice AI processing?

VoiceAI will accept these and most likely produce satisfactory results. However they will be a ‘par under’ in comparison with the ‘ideal’ format. The reason behind it is simple: lower audio quality affects the accuracy of the speech to text transcription, the basis of the successful analysis.

And last (as well as least), are the file formats that require re-coding in order to be digestible to the VoiceAI engine. Due to a low quality of audio, we cannot guarantee a successful outcome:

G.729 8 bit

MSADPCM 4 bit, 8 Khz, Mono

MSADPCM 4 bit, 8 Khz, Stereo

MSADPCM 4 bit, 16 Khz, Mono

MSADPCM 4 bit, 16 kHz, Stereo

These are the least recommended formats. During the process of re-coding valuable data is lost which affects the credibility of the data output. In this case, accuracy falls below 60%.

Below, you will find a list of all formats that VoiceAI supports, along with the types that we are unable to process.

In short, ideal audio is supplied with:

Split channel or Stereo (speaker in each channel)

No compression of the audio files

There are two preferred file formats ideal for VoiceAI processing:

PCM8 8 bit, 8 kHz, Stereo

G.711 8 bit, 8 kHz, Stereo

These allow for the fastest, most accurate treatment of audio files. There are also file formats that are compatible with VoiceAI engine but produce an output that can be affected by the lower audio quality resulting from compression:

PCM8 8 bit, 8 kHz, Mono

PCM16 16 bit, 8 kHz, Mono

PCM16 16 bit, 8 kHz, Stereo

GSM FR 8 Khz, Mono

G.711 8 bit, 8 kHz, Mono

G.723 VAD 8 bit, 8 kHz, Stereo

PCM8 8 bit, 16 kHz, Stereo

PCM16 16 bit, 16 kHz, Mono

PCM16 16 bit, 16 kHz, Stereo

G.711 8 bit, 16 kHz, Mono

G.711 8 bit, 16 kHz, Stereo

What is the best compression bit rate?

Audio compression codecs are “lossy”, meaning that they discard the information. For example, GSM FR 16 (GSM6.10) encodes audio at a rate of 13 Kbps (bps=bits per second). By comparison, typical telephone audio is encoded with G.711 at a rate of 64 Kbps. Relative to G.711, nearly 80% of the information has been eliminated from the speech signal in GSM FR 16 encoded audio. We can work with bit rates below 64 Kbps, but transcription accuracy drops as audio bit rate drops. For a rate as low as 13 Kbps, it is easy to hear metallic ringing compression artifacts in the audio, which makes it difficult even for humans to understand what is being said where noise or other distortions (muffling, low amplitude, etc.) exist.

Which file do you prefer to work with - Mono or Stereo?

Although the stereo files are a better option, mono files can be adapted by undergoing the process of Diarization. This exercise splits one audio channel into two.

This enables independent analysis of agent and caller sides of the conversation, as well as the ability to identify “speaker turns”. Both of these are useful in follow-on analytics. However, this process is not perfect. Compression, noise, and similar sounding speakers compound errors in the diarization process.

The benefit of 2-channel recording is an ability easy to detect overtalk, and even more importantly, to identify when the agent initiates overtalk, is this is one aspect of agent behaviour that is consistently correlated with degrading caller sentiment. It is valuable to identify agents who are doing this for targeted training.

How do you deal with beeps & music?

Call Journey CI employs noise detection and removal algorithms to eliminate these from the signal before sending the audio down the pipeline for further processing. This process is not perfect, specially with highly compressed audio where it is harder to distinguish between speech and noise. Audio should first be sampled to determine if it contains any of these non-speech elements.

Manual Reduction / Signal Cutting Out

With some audio, it is clear that passwords, account numbers, etc. have been redacted from the audio. In other cases, the signal may be cutting in and out somewhat randomly. It’s not clear if these are related. Both of these impact accuracy as they sometimes cut words in half, and in general result in unnatural word adjacencies.

How do you handle sensitive data?

VoiceAI is a native solution and doesn’t require the use of a third party. The audio files are accessible solely by Call Journey. Because the VoiceAI Ecosystem is a stand-alone server (not connected to the internet), there is also no possibility of a data leak. Signing a Non- Disclosure Agreement before undertaking the processing of audio is part of our policy. At Call Journey we regularly deal with sensitive data, adhering to the highest industry standards is our priority. Cleansed audio files can also be processed, as long as the audio quality hasn’t been affected.

Call Journey can assist customers to get the highest quality uncompressed audio available from their systems, which often involves improving the configuration settings of the recording solution. We only need access to high quality audio for a short period of time, after which it can be compressed as much as desired, so integration with our solution has little or no impact on audio storage space requirements.