Baum–Welch algorithm
In
History
The Baum–Welch algorithm was named after its inventors Leonard E. Baum and Lloyd R. Welch. The algorithm and the Hidden Markov models were first described in a series of articles by Baum and his peers at the IDA Center for Communications Research, Princeton in the late 1960s and early 1970s.[1] One of the first major applications of HMMs was to the field of speech processing.[2] In the 1980s, HMMs were emerging as a useful tool in the analysis of biological systems and information, and in particular genetic information.[3] They have since become an important tool in the probabilistic modeling of genomic sequences.[4]
Description
A
The Baum–Welch algorithm uses the well known EM algorithm to find the
Let be a discrete hidden random variable with possible values (i.e. We assume there are states in total). We assume the is independent of time , which leads to the definition of the time-independent stochastic transition matrix
The initial state distribution (i.e. when ) is given by
The observation variables can take one of possible values. We also assume the observation given the "hidden" state is time independent. The probability of a certain observation at time for state is given by
Taking into account all the possible values of and , we obtain the matrix where belongs to all the possible states and belongs to all the observations.
An observation sequence is given by .
Thus we can describe a hidden Markov chain by . The Baum–Welch algorithm finds a local maximum for (i.e. the HMM parameters that maximize the probability of the observation).[5]
Algorithm
Set with random initial conditions. They can also be set using prior information about the parameters if it is available; this can speed up the algorithm and also steer it toward the desired local maximum.
Forward procedure
Let , the probability of seeing the observations and being in state at time . This is found recursively:
Since this series converges exponentially to zero, the algorithm will numerically underflow for longer sequences.[6] However, this can be avoided in a slightly modified algorithm by scaling in the forward and in the backward procedure below.
Backward procedure
Let that is the probability of the ending partial sequence given starting state at time . We calculate as,
Update
We can now calculate the temporary variables, according to Bayes' theorem:
which is the probability of being in state at time given the observed sequence and the parameters
which is the probability of being in state and at times and respectively given the observed sequence and parameters .
The denominators of and are the same ; they represent the probability of making the observation given the parameters .
The parameters of the hidden Markov model can now be updated:
which is the expected frequency spent in state at time .
which is the expected number of transitions from state i to state j compared to the expected total number of transitions away from state i. To clarify, the number of transitions away from state i does not mean transitions to a different state j, but to any state including itself. This is equivalent to the number of times state i is observed in the sequence from t = 1 to t = T − 1.
where
is an indicator function, and is the expected number of times the output observations have been equal to while in state over the expected total number of times in state .
These steps are now repeated iteratively until a desired level of convergence.
Note: It is possible to over-fit a particular data set. That is, . The algorithm also does not guarantee a global maximum.
Multiple sequences
The algorithm described thus far assumes a single observed sequence . However, in many situations, there are several sequences observed: . In this case, the information from all of the observed sequences must be used in the update of the parameters , , and . Assuming that you have computed and for each sequence , the parameters can now be updated:
where
is an indicator function
Example
Suppose we have a chicken from which we collect eggs at noon every day. Now whether or not the chicken has laid eggs for collection depends on some unknown factors that are hidden. We can however (for simplicity) assume that the chicken is always in one of two states that influence whether the chicken lays eggs, and that this state only depends on the state on the previous day. Now we don't know the state at the initial starting point, we don't know the transition probabilities between the two states and we don't know the probability that the chicken lays an egg given a particular state.[7][8] To start we first guess the transition and emission matrices.
|
|
|
We then take a set of observations (E = eggs, N = no eggs): N, N, N, N, N, E, E, N, N, N
This gives us a set of observed transitions between days: NN, NN, NN, NN, NE, EE, EN, NN, NN
The next step is to estimate a new transition matrix. For example, the probability of the sequence NN and the state being then is given by the following,
Observed sequence | Highest probability of observing that sequence if state is then | Highest Probability of observing that sequence | |
---|---|---|---|
NN | 0.024 = 0.2 × 0.3 × 0.5 × 0.8 | 0.3584 | , |
NN | 0.024 = 0.2 × 0.3 × 0.5 × 0.8 | 0.3584 | , |
NN | 0.024 = 0.2 × 0.3 × 0.5 × 0.8 | 0.3584 | , |
NN | 0.024 = 0.2 × 0.3 × 0.5 × 0.8 | 0.3584 | , |
NE | 0.006 = 0.2 × 0.3 × 0.5 × 0.2 | 0.1344 | , |
EE | 0.014 = 0.2 × 0.7 × 0.5 × 0.2 | 0.0490 | , |
EN | 0.056 = 0.2 × 0.7 × 0.5 × 0.8 | 0.0896 | , |
NN | 0.024 = 0.2 × 0.3 × 0.5 × 0.8 | 0.3584 | , |
NN | 0.024 = 0.2 × 0.3 × 0.5 × 0.8 | 0.3584 | , |
Total | 0.22 | 2.4234 |
Thus the new estimate for the to transition is now (referred to as "Pseudo probabilities" in the following tables). We then calculate the to , to and to transition probabilities and normalize so they add to 1. This gives us the updated transition matrix:
|
|
|
Next, we want to estimate a new emission matrix,
Observed Sequence | Highest probability of observing that sequence if E is assumed to come from |
Highest Probability of observing that sequence | ||
---|---|---|---|---|
NE | 0.1344 | , | 0.1344 | , |
EE | 0.0490 | , | 0.0490 | , |
EN | 0.0560 | , | 0.0896 | , |
Total | 0.2394 | 0.2730 |
The new estimate for the E coming from emission is now .
This allows us to calculate the emission matrix as described above in the algorithm, by adding up the probabilities for the respective observed sequences. We then repeat for if N came from and for if N and E came from and normalize.
|
|
|
To estimate the initial probabilities we assume all sequences start with the hidden state and calculate the highest probability and then repeat for . Again we then normalize to give an updated initial vector.
Finally we repeat these steps until the resulting probabilities converge satisfactorily.
Applications
Speech recognition
Hidden Markov Models were first applied to speech recognition by
Cryptanalysis
The Baum–Welch algorithm is often used to estimate the parameters of HMMs in deciphering hidden or noisy information and consequently is often used in cryptanalysis. In data security an observer would like to extract information from a data stream without knowing all the parameters of the transmission. This can involve reverse engineering a channel encoder.[12] HMMs and as a consequence the Baum–Welch algorithm have also been used to identify spoken phrases in encrypted VoIP calls.[13] In addition HMM cryptanalysis is an important tool for automated investigations of cache-timing data. It allows for the automatic discovery of critical algorithm state, for example key values.[14]
Applications in bioinformatics
Finding genes
Prokaryotic
The
Eukaryotic
The
Copy-number variation detection
Implementations
- Accord.NET in C#
- ghmm C library with Python bindings that supports both discrete and continuous emissions.
- Jajapy Python library that implements Baum-Welch on various kind of Markov Models ( HMM, MC, MDP, CTMC).
- HiddenMarkovModels.jl package for Julia.
- HMMFit function in the RHmm package for R.
- hmmtrain in MATLAB
- rustbio in Rust
See also
- Viterbi algorithm
- Hidden Markov model
- EM algorithm
- Maximum likelihood
- Speech recognition
- Bioinformatics
- Cryptanalysis
References
- ^ Rabiner, Lawrence. "First Hand: The Hidden Markov Model". IEEE Global History Network. Retrieved 2 October 2013.
- .
- PMID 3641921.
- ISBN 978-0-521-62041-3.
- ^ Bilmes, Jeff A. (1998). A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models. Berkeley, CA: International Computer Science Institute. pp. 7–13.
- ^ Rabiner, Lawrence (February 1989). "A Tutorial on Hidden Markov Models and Selected Applications in Speech recognition" (PDF). Proceedings of the IEEE. Retrieved 29 November 2019.
- ^ "Baum-Welch and HMM applications" (PDF). Johns Hopkins Bloomberg School of Public Health. Retrieved 11 October 2019.
- ^ Frazzoli, Emilio. "Intro to Hidden Markov Models: the Baum-Welch Algorithm" (PDF). Aeronautics and Astronautics, Massachusetts Institute of Technology. Retrieved 2 October 2013.
- .
- S2CID 13618539.
- ^ Tokuda, Keiichi; Yoshimura, Takayoshi; Masuko, Takashi; Kobayashi, Takao; Kitamura, Tadashi (2000). "Speech Parameter Generation Algorithms for HMM-Based Speech Synthesis". IEEE International Conference on Acoustics, Speech, and Signal Processing. 3.
- ^ Dingel, Janis; Hagenauer, Joachim (24 June 2007). "Parameter Estimation of a Convolutional Encoder from Noisy Observations". IEEE International Symposium on Information Theory.
- ^ Wright, Charles; Ballard, Lucas; Coull, Scott; Monrose, Fabian; Masson, Gerald (2008). "Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations". IEEE International Symposium on Security and Privacy.
- ISBN 978-3-642-10365-0.
- PMID 9421513.
- ^ "Glimmer: Microbial Gene-Finding System". Johns Hopkins University - Center for Computational Biology.
- PMID 17237039.
- ^ Burge, Christopher. "The GENSCAN Web Server at MIT". Archived from the original on 6 September 2013. Retrieved 2 October 2013.
- PMID 9149143.
- PMID 9666331.
- PMID 17551006.
External links
- A comprehensive review of HMM methods and software in bioinformatics – Profile Hidden Markov Models
- Early HMM publications by Baum:
- A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains
- An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology
- Statistical Inference for Probabilistic Functions of Finite State Markov Chains
- The Shannon Lecture by Welch, which speaks to how the algorithm can be implemented efficiently:
- Hidden Markov Models and the Baum–Welch Algorithm, IEEE Information Theory Society Newsletter, Dec. 2003.
- An alternative to the Baum–Welch algorithm, the Viterbi Path Counting algorithm:
- Davis, Richard I. A.; Lovell, Brian C.; "Comparing and evaluating HMM ensemble training algorithms using train and test and condition number criteria", Pattern Analysis and Applications, vol. 6, no. 4, pp. 327–336, 2003.
- An Interactive Spreadsheet for Teaching the Forward-Backward Algorithm (spreadsheet and article with step-by-step walkthrough)
- Formal derivation of the Baum–Welch algorithm Archived 2012-02-28 at the Wayback Machine
- Implementation of the Baum–Welch algorithm