Mixture model
In
Mixture models should not be confused with models for compositional data, i.e., data whose components are constrained to sum to a constant value (1, 100%, etc.). However, compositional models can be thought of as mixture models, where members of the population are sampled at random. Conversely, mixture models can be thought of as compositional models, where the total size reading population has been normalized to 1.
Structure
General mixture model
A typical finite-dimensional mixture model is a
- N random variables that are observed, each distributed according to a mixture of K components, with the components belonging to the same parametric family of distributions (e.g., all normal, all Zipfian, etc.) but with different parameters
- N random latent variables specifying the identity of the mixture component of each observation, each distributed according to a K-dimensional categorical distribution
- A set of K mixture weights, which are probabilities that sum to 1.
- A set of K parameters, each specifying the parameter of the corresponding mixture component. In many cases, each "parameter" is actually a set of parameters. For example, if the mixture components are Gaussian distributions, there will be a mean and variance for each component. If the mixture components are categorical distributions(e.g., when each observation is a token from a finite alphabet of size V), there will be a vector of V probabilities summing to 1.
In addition, in a
Mathematically, a basic parametric mixture model can be described as follows:
In a Bayesian setting, all parameters are associated with random variables, as follows:
This characterization uses F and H to describe arbitrary distributions over observations and parameters, respectively. Typically H will be the
- Binomial distribution, for the number of "positive occurrences" (e.g., successes, yes votes, etc.) given a fixed number of total occurrences
- Multinomial distribution, similar to the binomial distribution, but for counts of multi-way occurrences (e.g., yes/no/maybe in a survey)
- Negative binomial distribution, for binomial-type observations but where the quantity of interest is the number of failures before a given number of successes occurs
- Poisson distribution, for the number of occurrences of an event in a given period of time, for an event that is characterized by a fixed rate of occurrence
- Exponential distribution, for the time before the next event occurs, for an event that is characterized by a fixed rate of occurrence
- Log-normal distribution, for positive real numbers that are assumed to grow exponentially, such as incomes or prices
- Multivariate normal distribution (aka multivariate Gaussian distribution), for vectors of correlated outcomes that are individually Gaussian-distributed
- Multivariate Student's t-distribution, for vectors of heavy-tailed correlated outcomes[1]
- A vector of Bernoulli-distributed values, corresponding, e.g., to a black-and-white image, with each value representing a pixel; see the handwriting-recognition example below
Specific examples
Gaussian mixture model
A typical non-Bayesian
A Bayesian version of a
Multivariate Gaussian mixture model
A Bayesian Gaussian mixture model is commonly extended to fit a vector of unknown parameters (denoted in bold), or multivariate normal distributions. In a multivariate distribution (i.e. one modelling a vector with N random variables) one may model a vector of parameters (such as several observations of a signal or patches within an image) using a Gaussian mixture model prior distribution on the vector of estimates given by
where the ith vector component is characterized by normal distributions with weights , means and covariance matrices . To incorporate this prior into a Bayesian estimation, the prior is multiplied with the known distribution of the data conditioned on the parameters to be estimated. With this formulation, the posterior distribution is also a Gaussian mixture model of the form
with new parameters and that are updated using the
Such distributions are useful for assuming patch-wise shapes of images and clusters, for example. In the case of image representation, each Gaussian may be tilted, expanded, and warped according to the covariance matrices . One Gaussian distribution of the set is fit to each patch (usually of size 8x8 pixels) in the image. Notably, any distribution of points around a cluster (see k-means) may be accurately given enough Gaussian components, but scarcely over K=20 components are needed to accurately model a given image distribution or cluster of data.
Categorical mixture model
A typical non-Bayesian mixture model with categorical observations looks like this:
- as above
- as above
- as above
- dimension of categorical observations, e.g., size of word vocabulary
- probability for component of observing item
- vector of dimension composed of must sum to 1
The random variables:
A typical Bayesian mixture model with categorical observations looks like this:
- as above
- as above
- as above
- dimension of categorical observations, e.g., size of word vocabulary
- probability for component of observing item
- vector of dimension composed of must sum to 1
- shared concentration hyperparameter of for each component
- concentration hyperparameter of
The random variables:
Examples
A financial model
Financial returns often behave differently in normal situations and during crisis times. A mixture model
House prices
Assume that we observe the prices of N different houses. Different types of houses in different neighborhoods will have vastly different prices, but the price of a particular type of house in a particular neighborhood (e.g., three-bedroom house in moderately upscale neighborhood) will tend to cluster fairly closely around the mean. One possible model of such prices would be to assume that the prices are accurately described by a mixture model with K different components, each distributed as a
Topics in a document
Assume that a document is composed of N different words from a total vocabulary of size V, where each word corresponds to one of K possible topics. The distribution of such words could be modelled as a mixture of K different V-dimensional
- A prior distribution is placed over the parameters describing the topic distributions, using a Dirichlet distribution with a concentration parameterthat is set significantly below 1, so as to encourage sparse distributions (where only a small number of words have significantly non-zero probabilities).
- Some sort of additional constraint is placed over the topic identities of words, to take advantage of natural clustering.
- For example, a prior distributionis placed over state transitions that favors transitions that stay in the same state.)
- Another possibility is the latent Dirichlet allocation model, which divides up the words into D different documents and assumes that in each document only a small number of topics occur with any frequency.
- For example, a
Handwriting recognition
The following example is based on an example in
Imagine that we are given an N×N black-and-white image that is known to be a scan of a hand-written digit between 0 and 9, but we don't know which digit is written. We can create a mixture model with different components, where each component is a vector of size of
Assessing projectile accuracy (a.k.a. circular error probable, CEP)
Mixture models apply in the problem of directing multiple projectiles at a target (as in air, land, or sea defense applications), where the physical and/or statistical characteristics of the projectiles differ within the multiple projectiles. An example might be shots from multiple munitions types or shots from multiple locations directed at one target. The combination of projectile types may be characterized as a Gaussian mixture model.[5] Further, a well-known measure of accuracy for a group of projectiles is the circular error probable (CEP), which is the number R such that, on average, half of the group of projectiles falls within the circle of radius R about the target point. The mixture model can be used to determine (or estimate) the value R. The mixture model properly captures the different types of projectiles.
Direct and indirect applications
The financial example above is one direct application of the mixture model, a situation in which we assume an underlying mechanism so that each observation belongs to one of some number of different sources or categories. This underlying mechanism may or may not, however, be observable. In this form of mixture, each of the sources is described by a component probability density function, and its mixture weight is the probability that an observation comes from this component.
In an indirect application of the mixture model we do not assume such a mechanism. The mixture model is simply used for its mathematical flexibilities. For example, a mixture of two
Predictive Maintenance
The mixture model-based clustering is also predominantly used in identifying the state of the machine in predictive maintenance. Density plots are used to analyze the density of high dimensional features. If multi-model densities are observed, then it is assumed that a finite set of densities are formed by a finite set of normal mixtures. A multivariate Gaussian mixture model is used to cluster the feature data into k number of groups where k represents each state of the machine. The machine state can be a normal state, power off state, or faulty state.[6] Each formed cluster can be diagnosed using techniques such as spectral analysis. In the recent years, this has also been widely used in other areas such as early fault detection.[7]
Fuzzy image segmentation
In image processing and computer vision, traditional image segmentation models often assign to one pixel only one exclusive pattern. In fuzzy or soft segmentation, any pattern can have certain "ownership" over any single pixel. If the patterns are Gaussian, fuzzy segmentation naturally results in Gaussian mixtures. Combined with other analytic or geometric tools (e.g., phase transitions over diffusive boundaries), such spatially regularized mixture models could lead to more realistic and computationally efficient segmentation methods.[8]
Point set registration
Probabilistic mixture models such as
Identifiability
Identifiability refers to the existence of a unique characterization for any one of the models in the class (family) being considered. Estimation procedures may not be well-defined and asymptotic theory may not hold if a model is not identifiable.
Example
Let J be the class of all binomial distributions with n = 2. Then a mixture of two members of J would have
and p2 = 1 − p0 − p1. Clearly, given p0 and p1, it is not possible to determine the above mixture model uniquely, as there are three parameters (π, θ1, θ2) to be determined.
Definition
Consider a mixture of parametric distributions of the same class. Let
be the class of all component distributions. Then the convex hull K of J defines the class of all finite mixture of distributions in J:
K is said to be identifiable if all its members are unique, that is, given two members p and p′ in K, being mixtures of k distributions and k′ distributions respectively in J, we have p = p′ if and only if, first of all, k = k′ and secondly we can reorder the summations such that ai = ai′ and ƒi = ƒi′ for all i.
Parameter estimation and system identification
Parametric mixture models are often used when we know the distribution Y and we can sample from X, but we would like to determine the ai and θi values. Such situations can arise in studies in which we sample from a population that is composed of several distinct subpopulations.
It is common to think of probability mixture modeling as a missing data problem. One way to understand this is to assume that the data points under consideration have "membership" in one of the distributions we are using to model the data. When we start, this membership is unknown, or missing. The job of estimation is to devise appropriate parameters for the model functions we choose, with the connection to the data points being represented as their membership in the individual model distributions.
A variety of approaches to the problem of mixture decomposition have been proposed, many of which focus on maximum likelihood methods such as
Expectation maximization (EM)
with the posterior probabilities
Thus on the basis of the current estimate for the parameters, the conditional probability for a given observation x(t) being generated from state s is determined for each t = 1, …, N ; N being the sample size. The parameters are then updated such that the new component weights correspond to the average conditional probability and each component mean and covariance is the component specific weighted average of the mean and covariance of the entire sample.
Dempster[15] also showed that each successive EM iteration will not decrease the likelihood, a property not shared by other gradient based maximization techniques. Moreover, EM naturally embeds within it constraints on the probability vector, and for sufficiently large sample sizes positive definiteness of the covariance iterates. This is a key advantage since explicitly constrained methods incur extra computational costs to check and maintain appropriate values. Theoretically EM is a first-order algorithm and as such converges slowly to a fixed-point solution. Redner and Walker (1984)[full citation needed] make this point arguing in favour of superlinear and second order Newton and quasi-Newton methods and reporting slow convergence in EM on the basis of their empirical tests. They do concede that convergence in likelihood was rapid even if convergence in the parameter values themselves was not. The relative merits of EM and other algorithms vis-à-vis convergence have been discussed in other literature.[16]
Other common objections to the use of EM are that it has a propensity to spuriously identify local maxima, as well as displaying sensitivity to initial values.[17][18] One may address these problems by evaluating EM at several initial points in the parameter space but this is computationally costly and other approaches, such as the annealing EM method of Udea and Nakano (1998) (in which the initial components are essentially forced to overlap, providing a less heterogeneous basis for initial guesses), may be preferable.
Figueiredo and Jain[13] note that convergence to 'meaningless' parameter values obtained at the boundary (where regularity conditions breakdown, e.g., Ghosh and Sen (1985)) is frequently observed when the number of model components exceeds the optimal/true one. On this basis they suggest a unified approach to estimation and identification in which the initial n is chosen to greatly exceed the expected optimal value. Their optimization routine is constructed via a minimum message length (MML) criterion that effectively eliminates a candidate component if there is insufficient information to support it. In this way it is possible to systematize reductions in n and consider estimation and identification jointly.
The expectation step
With initial guesses for the parameters of our mixture model, "partial membership" of each data point in each constituent distribution is computed by calculating
The maximization step
With expectation values in hand for group membership, plug-in estimates are recomputed for the distribution parameters.
The mixing coefficients ai are the means of the membership values over the N data points.
The component model parameters θi are also calculated by expectation maximization using data points xj that have been weighted using the membership values. For example, if θ is a mean μ
With new estimates for ai and the θi's, the expectation step is repeated to recompute new membership values. The entire procedure is repeated until model parameters converge.
Markov chain Monte Carlo
As an alternative to the EM algorithm, the mixture model parameters can be deduced using posterior sampling as indicated by Bayes' theorem. This is still regarded as an incomplete data problem whereby membership of data points is the missing data. A two-step iterative procedure known as Gibbs sampling can be used.
The previous example of a mixture of two
Moment matching
The method of moment matching is one of the oldest techniques for determining the mixture parameters dating back to Karl Pearson's seminal work of 1894. In this approach the parameters of the mixture are determined such that the composite distribution has moments matching some given value. In many instances extraction of solutions to the moment equations may present non-trivial algebraic or computational problems. Moreover, numerical analysis by Day[19] has indicated that such methods may be inefficient compared to EM. Nonetheless, there has been renewed interest in this method, e.g., Craigmile and Titterington (1998) and Wang.[20]
McWilliam and Loh (2009) consider the characterisation of a hyper-cuboid normal mixture
Spectral method
Some problems in mixture model estimation can be solved using spectral methods. In particular it becomes useful if data points xi are points in high-dimensional
Spectral methods of learning mixture models are based on the use of
One distinctive feature of the spectral method is that it allows us to prove that if distributions satisfy certain separation condition (e.g., not too close), then the estimated mixture will be very close to the true one with high probability.
Graphical Methods
Tarter and Lock[12] describe a graphical approach to mixture identification in which a kernel function is applied to an empirical frequency plot so to reduce intra-component variance. In this way one may more readily identify components having differing means. While this λ-method does not require prior knowledge of the number or functional form of the components its success does rely on the choice of the kernel parameters which to some extent implicitly embeds assumptions about the component structure.
Other methods
Some of them can even probably learn mixtures of heavy-tailed distributions including those with infinite variance (see links to papers below). In this setting, EM based methods would not work, since the Expectation step would diverge due to presence of outliers.
A simulation
To simulate a sample of size N that is from a mixture of distributions Fi, i=1 to n, with probabilities pi (sum= pi = 1):
- Generate N random numbers from a categorical distribution of size n and probabilities pi for i= 1= to n. These tell you which of the Fi each of the N values will come from. Denote by mi the quantity of random numbers assigned to the ith category.
- For each i, generate mi random numbers from the Fi distribution.
Extensions
In a Bayesian setting, additional levels can be added to the graphical model defining the mixture model. For example, in the common latent Dirichlet allocation topic model, the observations are sets of words drawn from D different documents and the K mixture components represent topics that are shared across documents. Each document has a different set of mixture weights, which specify the topics prevalent in that document. All sets of mixture weights share common hyperparameters.
A very common extension is to connect the
History
Mixture distributions and the problem of mixture decomposition, that is the identification of its constituent components and the parameters thereof, has been cited in the literature as far back as 1846 (Quetelet in McLachlan,
While his work was successful in identifying two potentially distinct sub-populations and in demonstrating the flexibility of mixtures as a moment matching tool, the formulation required the solution of a 9th degree (nonic) polynomial which at the time posed a significant computational challenge.
Subsequent works focused on addressing these problems, but it was not until the advent of the modern computer and the popularisation of
See also
Mixture
Hierarchical models
- Graphical model
- Hierarchical Bayes model
Outlier detection
- RANSAC
References
- S2CID 15583243.
- ^
Yu, Guoshen (2012). "Solving Inverse Problems with Piecewise Linear Estimators: From Gaussian Mixture Models to Structured Sparsity". IEEE Transactions on Image Processing. 21 (5): 2481–2499. S2CID 479845.
- ^ Dinov, ID. "Expectation Maximization and Mixture Modeling Tutorial". California Digital Library, Statistics Online Computational Resource, Paper EM_MM, http://repositories.cdlib.org/socr/EM_MM, December 9, 2008
- ISBN 978-0-387-31073-2.
- JSTOR 2290205
- .
- .
- ^
Shen, Jianhong (Jackie) (2006). "A stochastic-variational model for soft Mumford-Shah segmentation". International Journal of Biomedical Imaging. 2006: 2–16. PMID 23165059.
- ^
Myronenko, Andriy; Song, Xubo (2010). "Point set registration: Coherent point drift". IEEE Trans. Pattern Anal. Mach. Intell. 32 (12): 2262–2275. S2CID 10809031.
- ^
Ravikumar, Nishant; Gooya, Ali; Cimen, Serkan; Frangi, Alexjandro; Taylor, Zeike (2018). "Group-wise similarity registration of point sets using Student's t-mixture model for statistical shape models". Med. Image Anal. 44: 156–176. PMID 29248842.
- ^ Bayer, Siming; Ravikumar, Nishant; Strumia, Maddalena; Tong, Xiaoguang; Gao, Ying; Ostermeier, Martin; Fahrig, Rebecca; Maier, Andreas (2018). "Intraoperative brain shift compensation using a hybrid mixture model". Medical Image Computing and Computer Assisted Intervention – MICCAI 2018. Granada, Spain: Springer, Cham. pp. 116–124. .
- ^ a b c Tarter, Michael E. (1993), Model Free Curve Estimation, Chapman and Hall
- ^ .
- ^ McWilliam, N.; Loh, K. (2008), Incorporating Multidimensional Tail-Dependencies in the Valuation of Credit Derivatives (Working Paper) [1]
- ^ JSTOR 2984875.
- S2CID 207714252.
- ^ a b McLachlan, G.J. (2000), Finite Mixture Models, Wiley
- S2CID 6880171.
- JSTOR 2334652.
- ^ Wang, J. (2001), "Generating daily changes in market variables using a multivariate mixture of normal distributions", Proceedings of the 33rd Winter Conference on Simulation: 283–289
- S2CID 88515304.
- Bibcode:1988mmia.book.....M
- ^ Titterington, Smith & Makov 1985
Further reading
Books on mixture models
- Everitt, B.S.; Hand, D.J. (1981). Finite mixture distributions. Chapman & Hall. ISBN 978-0-412-22420-1.
- Lindsay, B. G. (1995). Mixture Models: Theory, Geometry, and Applications. NSF-CBMS Regional Conference Series in Probability and Statistics. Vol. 5. Hayward: Institute of Mathematical Statistics.
- Marin, J.M.; ISBN 9780444537324.
- McLachlan, G.J.; Peel, D. (2000). Finite Mixture Models. Wiley. ISBN 978-0-471-00626-8.
- Press, WH; Teukolsky, SA; Vetterling, WT; Flannery, BP (2007). "Section 16.1. Gaussian Mixture Models and k-Means Clustering". Numerical Recipes: The Art of Scientific Computing (3rd ed.). New York: Cambridge University Press. ISBN 978-0-521-88068-8.
- Titterington, D.; Smith, A.; Makov, U. (1985). Statistical Analysis of Finite Mixture Distributions. Wiley. ISBN 978-0-471-90763-3.
Application of Gaussian mixture models
- Reynolds, D.A.; Rose, R.C. (January 1995). "Robust text-independent speaker identification using Gaussian mixture speaker models". IEEE Transactions on Speech and Audio Processing. 3 (1): 72–83. S2CID 7319345.
- Permuter, H.; Francos, J.; Jermyn, I.H. (2003). Gaussian mixture models of texture and colour for image database retrieval. IEEE .
- Permuter, Haim; Francos, Joseph; Jermyn, Ian (2006). "A study of Gaussian mixture models of color and texture features for image classification and segmentation" (PDF). Pattern Recognition. 39 (4): 695–706. S2CID 8530776.
- Permuter, Haim; Francos, Joseph; Jermyn, Ian (2006). "A study of Gaussian mixture models of color and texture features for image classification and segmentation" (PDF). Pattern Recognition. 39 (4): 695–706.
- Lemke, Wolfgang (2005). Term Structure Modeling and Estimation in a State Space Framework. Springer Verlag. ISBN 978-3-540-28342-3.
- Brigo, Damiano; Mercurio, Fabio (2001). Displaced and Mixture Diffusions for Analytically-Tractable Smile Models. Mathematical Finance – Bachelier Congress 2000. Proceedings. Springer Verlag.
- Brigo, Damiano; Mercurio, Fabio (June 2002). "Lognormal-mixture dynamics and calibration to market volatility smiles". International Journal of Theoretical and Applied Finance. 5 (4): 427. .
- Spall, J. C.; Maryak, J. L. (1992). "A feasible Bayesian estimator of quantiles for projectile accuracy from non-i.i.d. data". Journal of the American Statistical Association. 87 (419): 676–681. JSTOR 2290205.
- Alexander, Carol (December 2004). "Normal mixture diffusion with uncertain volatility: Modelling short- and long-term smile effects" (PDF). Journal of Banking & Finance. 28 (12): 2957–80. .
- Stylianou, Yannis; Pantazis, Yannis; Calderero, Felipe; Larroy, Pedro; Severin, Francois; Schimke, Sascha; Bonal, Rolando; Matta, Federico; Valsamakis, Athanasios (2005). GMM-Based Multimodal Biometric Verification (PDF).
- Chen, J.; Adebomi, 0.E.; Olusayo, O.S.; Kulesza, W. (2010). The Evaluation of the Gaussian Mixture Probability Hypothesis Density approach for multi-target tracking. IEEE doi:10.1109/IST.2010.5548541.)
{{cite conference}}
: CS1 maint: numeric names: authors list (link
External links
- Nielsen, Frank (23 March 2012). "K-MLE: A fast algorithm for learning statistical mixture models". 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 869–872. S2CID 935615.
- The SOCR demonstrations of EM and Mixture Modeling
- Mixture modelling page (and the Snob program for MML) applied to finite mixture models), maintained by D.L. Dowe.
- PyMix – Python Mixture Package, algorithms and data structures for a broad variety of mixture model based data mining applications in Python
- sklearn.mixture – A module from the scikit-learn Python library for learning Gaussian Mixture Models (and sampling from them), previously packaged with SciPy and now packaged as a SciKit
- GMM.m Matlab code for GMM Implementation
- GPUmix C++ implementation of Bayesian Mixture Models using EM and MCMC with 100x speed acceleration using GPGPU.
- [2] Matlab code for GMM Implementation using EM algorithm
- [3] jMEF: A Java open source library for learning and processing mixtures of exponential families (using duality with Bregman divergences). Includes a Matlab wrapper.
- Very Fast and clean C implementation of the Expectation Maximization (EM) algorithm for estimating Gaussian Mixture Models (GMMs).
- mclust is an R package for mixture modeling.
- dpgmm Pure Python Dirichlet process Gaussian mixture model implementation (variational).
- Gaussian Mixture Models Blog post on Gaussian Mixture Models trained via Expectation Maximization, with an implementation in Python.