Speech processing
Speech processing is the study of
History
Early attempts at speech processing and recognition were primarily focused on understanding a handful of simple phonetic elements such as vowels. In 1952, three researchers at Bell Labs, Stephen. Balashek, R. Biddulph, and K. H. Davis, developed a system that could recognize digits spoken by a single speaker.[2] Pioneering works in field of speech recognition using analysis of its spectrum were reported in the 1940s.[3]
One of the first commercially available speech recognition products was Dragon Dictate, released in 1990. In 1992, technology developed by Lawrence Rabiner and others at Bell Labs was used by AT&T in their Voice Recognition Call Processing service to route calls without a human operator. By this point, the vocabulary of these systems was larger than the average human vocabulary.[6]
By the early 2000s, the dominant speech processing strategy started to shift away from
Techniques
Dynamic time warping
Dynamic time warping (DTW) is an algorithm for measuring similarity between two temporal sequences, which may vary in speed. In general, DTW is a method that calculates an optimal match between two given sequences (e.g. time series) with certain restriction and rules. The optimal match is denoted by the match that satisfies all the restrictions and the rules and that has the minimal cost, where the cost is computed as the sum of absolute differences, for each matched pair of indices, between their values.[citation needed]
Hidden Markov models
A hidden Markov model can be represented as the simplest dynamic Bayesian network. The goal of the algorithm is to estimate a hidden variable x(t) given a list of observations y(t). By applying the Markov property, the conditional probability distribution of the hidden variable x(t) at time t, given the values of the hidden variable x at all times, depends only on the value of the hidden variable x(t − 1). Similarly, the value of the observed variable y(t) only depends on the value of the hidden variable x(t) (both at time t).[citation needed]
Artificial neural networks
]Phase-aware processing
Phase is usually supposed to be random uniform variable and thus useless. This is due wrapping of phase:
Applications
- Interactive voice response
- Virtual Assistants
- Voice Identification
- Emotion Recognition
- Call Center Automation
- Robotics
See also
- Computational audiology
- Neurocomputational speech processing
- Speech coding
- Speech technology
- Natural Language Processing
References
- arXiv:1911.02388 [eess.AS].
- ISBN 9780080448541
- ^ Myasnikov, L. L.; Myasnikova, Ye. N. (1970). Automatic recognition of sound pattern (in Russian). Leningrad: Energiya.
- ^ ISSN 1932-8346.
- ^ "VC&G - VC&G Interview: 30 Years Later, Richard Wiggins Talks Speak & Spell Development".
- S2CID 6175701.
- ^ S2CID 13058142. Retrieved 2017-12-03.
- ISBN 978-1-119-23882-9.
- ^ a b c Kulmer, Josef; Mowlaee, Pejman (April 2015). "Harmonic phase estimation in single-channel speech enhancement using von Mises distribution and prior SNR". Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE. pp. 5063–5067.
- S2CID 15503015. Retrieved 2017-12-03.
- ^ S2CID 17409161. Retrieved 2017-12-03.