An Introduction to Mixture Models

Keywords: mixture models

From Supervised to Unsupervised Learning[1]

We have discussed many machine learning algorithms, including linear regression, linear classification, neural network models and e.t.c, till now. However, most of them are supervised learning, which means a teacher is leading the models to bias to a certain task. To these problems our major attention was on the probability distribution of parameters given inputs, outputs, and models:


where θ\boldsymbol{\theta} is a vector of the parameters in the model MM and x\boldsymbol{x}, y\boldsymbol{y} are inputs vector and output vector respectively. As what Bayesian equation told us, to maximize equation (1) we can use an indirect method, through which the maximum likelihood is always employed. And the probability


is the key component of the method. More details about the maximum likelihood method can be found Maximum Likelihood Estimation.

In today’s discussion, another probability will come to our’s mind. If we have no information about the class of the data points from the training set, it is to say we have no teacher in the training stage. What we would have to concern, now, is just:


Our task, now, could be no more called classification nor regression. It is referred to as ‘cluster’ which is a progress of identifying which group the data point belongs to. What we have here is just a set of training points x\boldsymbol{x} and the probability
p(x)p(\boldsymbol{x}). Although this probability is over one random variable, it can be arbitrarily complex. And sometimes, bringing in another random variable as assistance and combining them could get a more tractable distribution. That is to say, a joint distribution of observed variable x\boldsymbol{x} and another created random variable z\boldsymbol{z} could be more clear than the original distribution of x\boldsymbol{x}. And sometimes under this combination, the conditional distribution of x\boldsymbol{x} given z\boldsymbol{z} is very clear, too.

Let’s have look at a very simple example. x\boldsymbol{x} has a distribution:


where aa and bb are coeficients and μ1\mu_1, μ2\mu_2 are mean of the Gaussian distributions and δ1\delta_1 and δ2\delta_2 are variance.
It looks like:

The random variable of equation (4) is just xx. However, now, we introduc anthor random varible z{0,1}z\in \{0,1\} as a helper, in which zz has a uniform distribution. Then the distribution (4) can be rewritten in a joint form:

p(x,z)=zaexp((xμ1)2δ1)+(1z)bexp((xμ2)2δ2)(5)p(x,z)=z\cdot a\cdot\exp(-\frac{(x-\mu_1)^2}{\delta_1})+(1-z)\cdot b\cdot\exp(-\frac{(x-\mu_2)^2}{\delta_2})\tag{5}

for z\boldsymbol{z} is discrete random variable vector who has a uniform distribution so p(z=0)=p(z=1)=0.5p(z=0)=p(z=1)=0.5 and the conditional distribution is p(xz=1)p(x|z=1) is


looks like:

And the conditional distribution is p(xz=0)p(x|z=0) is


looks like:

And the margin distribution of xx is still equation (4)

and its 3D vision is:

So the conditional distribution of xx given zz is just a single-variable Gaussian model, which is much simple to deal with. And so is the margin p(x)p(x) by sum up all zz (the rule of computing the margin distribution). Here the created random variable zz is called a latent variable. It can be considered as an assistant input. However, it can also be considered as a kind of parameter of the model(we will discuss the details later). And the example above is the simplest Gaussian mixture which is wildly used in machine learning, statistics, and other fields. Here zz, the latent variable, is a discrete random variable, and the continuous latent variable will be introduced later as well.

Mixture distribution can be used to cluster data and what we are going to study are:

  1. Nonprobabilistic version: K-means algorithm
  2. Discrete latent variable and a relative algorithm which is known as EM algorithm


  1. Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006. ↩︎