Keywords: mixture models
From Supervised to Unsupervised Learning
We have discussed many machine learning algorithms, including linear regression, linear classification, neural network models and e.t.c, till now. However, most of them are supervised learning, which means a teacher is leading the models to bias to a certain task. To these problems our major attention was on the probability distribution of parameters given inputs, outputs, and models:
where is a vector of the parameters in the model and , are inputs vector and output vector respectively. As what Bayesian equation told us, to maximize equation (1) we can use an indirect method, through which the maximum likelihood is always employed. And the probability
is the key component of the method. More details about the maximum likelihood method can be found Maximum Likelihood Estimation.
In today’s discussion, another probability will come to our’s mind. If we have no information about the class of the data points from the training set, it is to say we have no teacher in the training stage. What we would have to concern, now, is just:
Our task, now, could be no more called classification nor regression. It is referred to as ‘cluster’ which is a progress of identifying which group the data point belongs to. What we have here is just a set of training points and the probability
. Although this probability is over one random variable, it can be arbitrarily complex. And sometimes, bringing in another random variable as assistance and combining them could get a more tractable distribution. That is to say, a joint distribution of observed variable and another created random variable could be more clear than the original distribution of . And sometimes under this combination, the conditional distribution of given is very clear, too.
Let’s have look at a very simple example. has a distribution:
where and are coeficients and , are mean of the Gaussian distributions and and are variance.
It looks like:
The random variable of equation (4) is just . However, now, we introduc anthor random varible as a helper, in which has a uniform distribution. Then the distribution (4) can be rewritten in a joint form:
for is discrete random variable vector who has a uniform distribution so and the conditional distribution is is
And the conditional distribution is is
And the margin distribution of is still equation (4)
and its 3D vision is:
So the conditional distribution of given is just a single-variable Gaussian model, which is much simple to deal with. And so is the margin by sum up all (the rule of computing the margin distribution). Here the created random variable is called a latent variable. It can be considered as an assistant input. However, it can also be considered as a kind of parameter of the model(we will discuss the details later). And the example above is the simplest Gaussian mixture which is wildly used in machine learning, statistics, and other fields. Here , the latent variable, is a discrete random variable, and the continuous latent variable will be introduced later as well.
Mixture distribution can be used to cluster data and what we are going to study are:
- Nonprobabilistic version: K-means algorithm
- Discrete latent variable and a relative algorithm which is known as EM algorithm
Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006. ↩︎