Maximum Likelihood of Gaussian Mixtures

Keywords: maximum likelihood, Gaussian mixtures

Maximum Likelihood[1]

Gaussian mixtures had been discussed in ‘Mixtures of Gaussians’. And once we have training data and a certain hypothesis, what we should do next is estimating the parameters of the model. Both kinds of parameters from a mixture of Gaussians

p(x)=k=1KπkN(xμk,Σk)(1)p(\boldsymbol{x})= \sum_{k=1}^{K}\pi_k\mathcal{N}(\boldsymbol{x}|\boldsymbol{\mu}_k,\Sigma_k)\tag{1}

and latent variables:

z(2)\boldsymbol{z}\tag{2}

we defined, need to estimate. However, when we investigated the linear model ‘Maximum Likelihood Estimation’, we considered the prediction yy of the linear model has a Gaussian distribution, and then maximum likelihood was used to derive parameters in the model. So, we could employ the maximum likelihood again in the Gaussian mixture task, and find whether it could work well.

Firstly, we should prepare the notations that will be used in following analysis:

  • input data {x1,,xN}\{\boldsymbol{x}_1,\cdots,\boldsymbol{x}_N\} for xiRD\boldsymbol{x}_i\in \mathbb{R}^D and i={1,2,,N}i=\{1,2,\cdots,N\} and assuming they are i.i.d. Rearranging them in a matrix:

X=[x1Tx2TxNT](3)X = \begin{bmatrix} -&\boldsymbol{x}_1^T&-\\ -&\boldsymbol{x}_2^T&-\\ &\vdots&\\ -&\boldsymbol{x}_N^T&-\\ \end{bmatrix}\tag{3}

  • Latent varibales zi\boldsymbol{z}_i, the assistant random varible to xi\boldsymbol{x}_i for i{1,,N}i\in\{1,\cdots,N\}. And according to matrix (3), the matrix of latent varibales is

Z=[z1Tz2TzNT](4)Z = \begin{bmatrix} -&\boldsymbol{z}_1^T&-\\ -&\boldsymbol{z}_2^T&-\\ &\vdots&\\ -&\boldsymbol{z}_N^T&-\\ \end{bmatrix}\tag{4}

Once we got these two matrices, basing on the equation (1) in ‘Mixtures of Gaussians’ as following:

p(x)=k=1KπkN(xμk,Σk)(5)p(\boldsymbol{x})= \sum_{k=1}^{K}\pi_k\mathcal{N}(\boldsymbol{x}|\boldsymbol{\mu}_k,\Sigma_k)\tag{5}

the log-likelihood function is given by:

lnp(xπ,μ,Σ)=ln(Πn=1Nk=1KπkN(xμk,Σk))=n=1Nlnk=1KπkN(xnμk,Σk)(6)\begin{aligned} \ln p(\boldsymbol{x}|\boldsymbol{\pi},\boldsymbol{\mu},\Sigma)&=\ln (\Pi_{n=1}^N\sum_{k=1}^{K}\pi_k\mathcal{N}(\boldsymbol{x}|\boldsymbol{\mu}_k,\Sigma_k))\\ &=\sum_{n=1}^{N}\ln \sum_{k=1}^{K}\pi_k\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\Sigma_k)\\ \end{aligned}\tag{6}

This looks different from the single Gaussian model where the logarithm operates directly on N(xμk,Σk)\mathcal{N}(\boldsymbol{x}|\boldsymbol{\mu}_k,\Sigma_k) who is an exponential function. However, the existence of summation in the logarithm makes the problem hard to solve.

And for the combination of the Gaussian mixture is arbitrary. We could have K!K! equivalent solutions. So which one we get did not effect on our model.

Maximum Likelihood Could Fail

The other differences from the single Gaussian model are the singularity of the covariance matrix and the condition xn=μj\boldsymbol{x}_n=\boldsymbol{\mu}_j.

Covariance Matrix Σ\Sigma is not Singular

For in the Gaussian distribution, the covariance matrix must be able to be inverted. So in our following discussion, we assume all the covariance matrice are invisible. For simplicity we take Σk=δk2I\Sigma_k=\delta_k^2 I where II is identity matrix.

When xn=μj\boldsymbol{x}_n=\boldsymbol{\mu}_j

When a point in sample accidentally equals to the mean μj\mu_j, the Gaussin distribution of the random variable xn\boldsymbol{x}_n is:

N(xnμk,δj2I)=12π121δj(7)\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\delta_j^2I)=\frac{1}{2\pi^{\frac{1}{2}}}\frac{1}{\delta_j}\tag{7}

When the variance δj0\delta_j\to 0, this part goes to infinity and the whole algorithm failed.

This problem does not exist in a single Gaussian model, for the xnμj=0\boldsymbol{x}_n-\boldsymbol{\mu}_j=0 is not an exponent in its log-likelihood.

Conclusion

The maximum likelihood method is not suited for a Gaussian mixture model. Then we introduce the EM algorithm in the next post.

References


  1. Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006. ↩︎