Voice activity detection

Source: Wikipedia, the free encyclopedia.

Voice activity detection (VAD), also known as speech activity detection or speech detection, is the detection of the presence or absence of human speech, used in

Voice over Internet Protocol (VoIP) applications, saving on computation and on network bandwidth
.

VAD is an important enabling technology for a variety of speech-based applications. Therefore, various VAD algorithms have been developed that provide varying features and compromises between

sustained
. Voice activity detection is usually independent of language.

It was first investigated for use on time-assignment speech interpolation (TASI) systems.[3]

Algorithm overview

The typical design of a VAD algorithm is as follows:[citation needed]

  1. There may first be a noise reduction stage, e.g. via spectral subtraction.
  2. Then some features or quantities are calculated from a section of the input signal.
  3. A
    classification rule
    is applied to classify the section as speech or non-speech – often this classification rule finds when a value exceeds a certain threshold.

There may be some feedback in this sequence, in which the VAD decision is used to improve the noise estimate in the noise reduction stage, or to adaptively vary the threshold(s). These feedback operations improve the VAD performance in non-stationary noise (i.e. when the noise varies a lot).[citation needed]

A representative set of recently published VAD methods formulates the decision rule on a frame by frame basis using instantaneous measures of the divergence distance between speech and noise.[citation needed] The different measures which are used in VAD methods include spectral slope, correlation coefficients, log likelihood ratio, cepstral, weighted cepstral, and modified distance measures.[citation needed]

Independently from the choice of VAD algorithm, a compromise must be made between having voice detected as noise, or noise detected as voice (between false positive and false negative). A VAD operating in a mobile phone must be able to detect speech in the presence of a range of very diverse types of acoustic background noise. In these difficult detection conditions it is often preferable that a VAD should fail-safe, indicating speech detected when the decision is in doubt, to lower the chance of losing speech segments. The biggest difficulty in the detection of speech in this environment is the very low signal-to-noise ratios (SNRs) that are encountered. It may be impossible to distinguish between speech and noise using simple level detection techniques when parts of the speech utterance are buried below the noise.

Applications

For a wide range of applications such as digital mobile radio,

storage chips
. However, the improvement depends mainly on the percentage of pauses during speech and the reliability of the VAD used to detect these intervals. On the one hand, it is advantageous to have a low percentage of speech activity. On the other hand, clipping, that is the loss of milliseconds of active speech, should be minimized to preserve quality. This is the crucial problem for a VAD algorithm under heavy noise conditions.

Use in telemarketing

One controversial application of VAD is in conjunction with

predictive dialers used by telemarketing firms. In order to maximize agent productivity, telemarketing firms set up predictive dialers to call more numbers than they have agents available, knowing most calls will end up in either "Ring – No Answer" or answering machines. When a person answers, they typically speak briefly ("Hello", "Good evening", etc.) and then there is a brief period of silence. Answering machine messages are usually 3–15 seconds of continuous speech. By setting VAD parameters correctly, dialers can determine whether a person or a machine answered the call and, if it's a person, transfer the call to an available agent. If it detects an answering machine message, the dialer hangs up. Often, even when the system correctly detects a person answering the call, no agent may be available, resulting in a "silent call". Call screening with a multi-second message like "please say who you are, and I may pick up the phone" will frustrate such automated calls.[citation needed
]

Performance evaluation

To evaluate a VAD, its output using test recordings is compared with those of an "ideal" VAD – created by hand-annotating the presence or absence of voice in the recordings. The performance of a VAD is commonly evaluated on the basis of the following four parameters:[4]

  • FEC (Front End Clipping): clipping introduced in passing from noise to speech activity;
  • MSC (Mid Speech Clipping): clipping due to speech misclassified as noise;
  • OVER: noise interpreted as speech due to the VAD flag remaining active in passing from speech activity to noise;
  • NDS (Noise Detected as Speech): noise interpreted as speech within a silence period.

Although the method described above provides useful objective information concerning the performance of a VAD, it is only an approximate measure of the subjective effect. For example, the effects of speech signal clipping can at times be hidden by the presence of background noise, depending on the model chosen for the comfort noise synthesis, so some of the clipping measured with objective tests is in reality not audible. It is therefore important to carry out subjective tests on VADs, the main aim of which is to ensure that the clipping perceived is acceptable. In VoIP applications, front-end clipping can be reduced by rewinding to shortly before the detection and sending very slightly delayed data.

This kind of test requires a certain number of listeners to judge recordings containing the processing results of the VADs being tested, giving marks to several speech sequences on the following features:

  • Quality;
  • Comprehension difficulty;
  • Audibility of clipping.

These marks are then used to calculate average results for each of the features listed above, thus providing a global estimate of the behavior of the VAD being tested.

To conclude, whereas objective methods are very useful in an initial stage to evaluate the quality of a VAD, subjective methods are more significant. As they require the participation of several people for a few days, increasing cost, they are generally only used when a proposal is about to be standardized.

Implementations

  • One early standard VAD is that developed by
    inverse filtering trained on non-speech segments to filter out background noise, so that it can then more reliably use a simple power-threshold to decide if a voice is present.[5]
  • The G.729 standard calculates the following features for its VAD: line spectral frequencies, full-band energy, low-band energy (<1 kHz), and zero-crossing rate. It applies a simple classification using a fixed decision boundary in the space defined by these features, and then applies smoothing and adaptive correction to improve the estimate.[6]
  • The GSM standard includes two VAD options developed by ETSI.[7] Option 1 computes the SNR in nine bands and applies a threshold to these values. Option 2 calculates different parameters: channel power, voice metrics, and noise power. It then thresholds the voice metrics using a threshold that varies according to the estimated SNR.
  • The Speex audio compression library uses a procedure named Improved Minima Controlled Recursive Averaging, which uses a smoothed representation of spectral power and then looks at the minima of a smoothed periodogram.[8] From version 1.2 it was replaced by what the author called a kludge.[9]
  • Wikimedia tool and project of language documentation
    , using VAD to allow recording many pronunciations in a short amount of time.
  • The VAD Android library[10] utilizes a combination of GMM and DNN models, such as WebRTC GMM, Silero DNN, and Yamnet DNN. The library surpasses many production-grade models in both quality and performance.

See also

References

  1. ^ Manoj Bhatia; Jonathan Davidson; Satish Kalidindi; Sudipto Mukherjee; James Peters (20 October 2006). "VoIP: An In-Depth Analysis - Voice Activity Detection". Cisco.
  2. ].
  3. .
  4. .
  5. .
  6. .
  7. ^ ETSI (1999). "GSM 06.42, Digital cellular telecommunications system (Phase 2+); Half rate speech; Voice Activity Detector (VAD) for half rate speech traffic channels" (Document). ETSI.
  8. .
  9. ^ "Speex VAD algorithm". 30 September 2004.
  10. ^ "Android Voice Activity Detection (VAD) library. Supports WebRTC VAD GMM, Silero VAD DNN, Yamnet VAD DNN models". Github. Retrieved 27 November 2019.