CAPTURING AUDIO DATA FAQS

VoiceAI is a powerful engine that turns audio into a rich data stream for use in upstream applications such as analytics or CRM systems.

The better the quality of the audio, the more information can be retrieved, and the higher accuracy of the resulting data output.

What phone calls can be processed - recorded or live?

VoiceAI can process both kinds of phone calls – live and recorded.

  • In the case of live calls, each phone call is split at its origin into two separate channels (one for each speaker) and then processed with a delay below a fraction of a second. The processing is limited by the number of available ports. One VoiceAI server can simultaneously process thousands of conversations per cluster divided between two speakers. The solution can be scaled by multiplying the numbers of active servers.
  • In the case of recorded files, the most important condition for successful processing is the correct format of the audio files. Recorded phone calls can be processed at a rate of approximately 120,000 minutes per day. It must be noted that the processing time can differ depending on how many of the calls contain speech, what’s the speech VS silence ratio or how many utterances there are per minute.
What file format is suitable for VoiceAI Processing?

VoiceAI will accept these and most likely produce satisfactory results. However they will be a ‘par under’ in comparison with the ‘ideal’ format. The reason behind it is simple: lower audio quality affects the accuracy of the speech to text transcription, the basis of the successful analysis.

And last (as well as least), are the file formats that require re-coding in order to be digestible to the VoiceAI engine. Due to a low quality of audio, we cannot guarantee a successful outcome:

  • G.729 8 bit
  • MSADPCM 4 bit, 8 Khz, Mono
  • MSADPCM 4 bit, 8 Khz, Stereo
  • MSADPCM 4 bit, 16 Khz, Mono
  • MSADPCM 4 bit, 16 kHz, Stereo

These are the least recommended formats. During the process of re-coding valuable data is lost which affects the credibility of the data output. In this case, accuracy falls below 60%.

Below, you will find a list of all formats that VoiceAI supports, along with the types that we are unable to process.

In short, ideal audio is supplied with:

  • Split channel or Stereo (speaker in each channel)
  • No compression of the audio files

There are two preferred file formats ideal for VoiceAI processing:

  • PCM8 8 bit, 8 kHz, Stereo
  • G.711 8 bit, 8 kHz, Stereo

These allow for the fastest, most accurate treatment of audio files. There are also file formats that are compatible with VoiceAI engine but produce an output that can be affected by the lower audio quality resulting from compression:

  • PCM8 8 bit, 8 kHz, Mono
  • PCM16 16 bit, 8 kHz, Mono
  • PCM16 16 bit, 8 kHz, Stereo
  • GSM FR 8 Khz, Mono
  • G.711 8 bit, 8 kHz, Mono
  • G.723 VAD 8 bit, 8 kHz, Stereo
  • PCM8 8 bit, 16 kHz, Stereo
  • PCM16 16 bit, 16 kHz, Mono
  • PCM16 16 bit, 16 kHz, Stereo
  • G.711 8 bit, 16 kHz, Mono
  • G.711 8 bit, 16 kHz, Stereo
What is the best compression bit rate?

Audio compression codecs are "lossy", meaning that they discard the information. For example, GSM FR 16 (GSM6.10) encodes audio at a rate of 13 Kbps (bps=bits per second). By comparison, typical telephone audio is encoded with G.711 at a rate of 64 Kbps. Relative to G.711, nearly 80% of the information has been eliminated from the speech signal in GSM FR 16 encoded audio. We can work with bit rates below 64 Kbps, but transcription accuracy drops as audio bit rate drops. For a rate as low as 13 Kbps, it is easy to hear metallic ringing compression artifacts in the audio, which makes it difficult even for humans to understand what is being said where noise or other distortions (muffling, low amplitude, etc.) exist.

Which file do you prefer to work with - Mono or Stereo?

Although the stereo files are a better option, mono files can be adapted by undergoing the process of Diarization. This exercise splits one audio channel into two.

This enables independent analysis of agent and caller sides of the conversation, as well as the ability to identify "speaker turns". Both of these are useful in follow-on analytics. However, this process is not perfect. Compression, noise, and similar sounding speakers compound errors in the diarization process.

The benefit of 2-channel recording is an ability easy to detect overtalk, and even more importantly, to identify when the agent initiates overtalk, is this is one aspect of agent behaviour that is consistently correlated with degrading caller sentiment. It is valuable to identify agents who are doing this for targeted training.

How do you deal with beeps & music?

Call Journey employs noise detection and removal algorithms to eliminate these from the signal before sending the audio down the pipeline for further processing. This process is not perfect, specially with highly compressed audio where it is harder to distinguish between speech and noise. Audio should first be sampled to determine if it contains any of these non-speech elements.

Manual Redaction / Signal Cutting Out

With some audio, it is clear that passwords, account numbers, etc. have been redacted from the audio. In other cases, the signal may be cutting in and out somewhat randomly. It's not clear if these are related. Both of these impact accuracy as they sometimes cut words in half, and in general result in unnatural word adjacencies.

How do you handle sensitive data?

VoiceAI is a native solution and doesn’t require the use of a third party. The audio files are accessible solely by Call Journey. Because the VoiceAI Ecosystem is a stand-alone server (not connected to the internet), there is also no possibility of a data leak. Signing a Non- Disclosure Agreement before undertaking the processing of audio is part of our policy. At Call Journey we regularly deal with sensitive data, adhering to the highest industry standards is our priority. Cleansed audio files can also be processed, as long as the audio quality hasn’t been affected.

Call Journey can assist customers to get the highest quality uncompressed audio available from their systems, which often involves improving the configuration settings of the recording solution. We only need access to high quality audio for a short period of time, after which it can be compressed as much as desired, so integration with our solution has little or no impact on audio storage space requirements.

Do you have samples for output file size?
The table below sets out the average file size based on the VoiceAI output type.
DurationXMLJSONCSV
1 MIN0.04 MB0.03 MB0.01 MB
1 HR2.22 MB2 MB1.09 MB
24 HRS53.5 MB48.3 MB26.2 MB
Audio Recording Format Information

The table below sets out the most common codex used in the telephony industry and the compatibility with the VoiceAI Service.

Compatibility Legend:

1. YES: Fully supported and VoiceAI preferred format.

2. BE (Best efforts): Codex can be reencoded but accuracy is likely to be reduced.

3. NO: Not compatible

4. Untested

CompatibilityCodec InformationextpSs30-sec file size Kbyte1-min file size Kbyte30-min file size Kbyte1hr file Mbyte
YESPCM8 8 bit, 8 Khz, Stereowav15.646893627.4254.84
YESG.711 8 bit, 8 Khz,Stereowav15.646893627.4254.84
BEPCM8 8 bit, 8 Khz, Monowav7.83323547013.7727.54
BEPCM16 16 bit, 8 Khz, Monowav15.646893627.4258.84
BEPCM16 16 bit, 8 Khz, Stereowav31.25937.5187554.93109.86
BEGSM FR 8 Khz, Monowav1.6675010013.775.86
BEG.711 8 bit, 8 Khz, Monowav7.817234.546913.742.58
BEPCM8 8 bit, 16 Khz, Monowav15.646896327.4254.84
BEPCM8 8 bit, 16 Khz, Stereowav31.25937.5187554.93109.86
BE PCM16 16 bit, 16 Khz, Monowav 31.25 937.5 1875 54.93 109.86
BE PCM16 16 bit, 16 Khz, Stereo wav 62.48 1874.5 3749 109.83 219.67
BE GSM FR 8 Khz, Mono wav 1.667 50 100 2.93 54.84
BE GSM FR16 Khz, Monoogg 1.5 45 90 2.64 5.27
BE G.711 8 bit, 16 Khz, Mono wav 15.6 468 936 27.42 54.84
BE G.711 8 bit, 16 Khz, Stereowav 31.25 937.5 1875 54.93 109.86
NO Speex 8 Khz, Mono ogg 0.75 22.5 45 1.32 2.64
NO Speex 8 Khz, Stereoogg 0.975 29.25 58.5 1.71 3.43
NO Speex VAD 8 Khz, Monoogg 0.567 17 34 1 1.99
NO Speex VAD 8 Khz, Stereoogg 0.733 22 44 1.29 2.58
NO G.723.1 8 bit, 8 Khz, Monovf * 0.667 20 40 1.17 2.34
NO G.723 8 bit, 8 Khz, Stereo vf * 1.333 40 80 2.34 4.69
NO G.723 VAD 8 bit, 8 Khz, Mono vf * 0.567 17 34 1 1.99
NO G.723 VAD 8 bit, 8 Khz, Stereovf * 0.8 24 48 1.41 2.81
NO Speex 16 Khz, Stereoogg 1.95 58.5 117 3.43 6.86
NO Speex 16 Khz, Mono ogg 1.133 34 68 1.99 3.98
NO Speex VAD 16 Khz, Stereo ogg 1.467 44 88 2.58 5.16
NO G.723.1 8 bit, 8 Khz, Mono vf * 0.667 20 40 1.17 2.34
NO G.723.1 8 bit, 8 Khz, Stereo vf * 1.333 40 80 2.34 4.69
NO G.723.1 VAD 8 bit, 8 Khz, Mono vf * 0.567 17 34 1 1.99
NO G.723.1 VAD 8 bit, 8 Khz, Stereo vf * 0.8 24 48 1.41 2.81
Untested G.729 8 bit wav 8.2 246 492 14.41 28.83
Untested MSADPCM 4 bit, 8 Khz, Mono wav 3.883 116.5 233 6.83 13.65
Untested MSADPCM 4 bit, 8 Khz, Stereo wav 7.817 234.5 469 13.74 27.48
Untested MSADPCM 4 bit, 16 Khz, Mono wav 7.817 234.5 469 13.74 27.48
UntestedMSADPCM 4 bit, 16 Khz, Stereowav15.646893627.4254.84