Keywords: maximum likelihood, Gaussian mixtures

## Maximum Likelihood[1]

Gaussian mixtures had been discussed in ‘Mixtures of Gaussians’. And once we have training data and a certain hypothesis, what we should do next is estimating the parameters of the model. Both kinds of parameters from a mixture of Gaussians

$p(\boldsymbol{x})= \sum_{k=1}^{K}\pi_k\mathcal{N}(\boldsymbol{x}|\boldsymbol{\mu}_k,\Sigma_k)\tag{1}$

and latent variables:

$\boldsymbol{z}\tag{2}$

we defined, need to estimate. However, when we investigated the linear model ‘Maximum Likelihood Estimation’, we considered the prediction $y$ of the linear model has a Gaussian distribution, and then maximum likelihood was used to derive parameters in the model. So, we could employ the maximum likelihood again in the Gaussian mixture task, and find whether it could work well.

Firstly, we should prepare the notations that will be used in following analysis:

• input data $\{\boldsymbol{x}_1,\cdots,\boldsymbol{x}_N\}$ for $\boldsymbol{x}_i\in \mathbb{R}^D$ and $i=\{1,2,\cdots,N\}$ and assuming they are i.i.d. Rearranging them in a matrix:

$X = \begin{bmatrix} -&\boldsymbol{x}_1^T&-\\ -&\boldsymbol{x}_2^T&-\\ &\vdots&\\ -&\boldsymbol{x}_N^T&-\\ \end{bmatrix}\tag{3}$

• Latent varibales $\boldsymbol{z}_i$, the assistant random varible to $\boldsymbol{x}_i$ for $i\in\{1,\cdots,N\}$. And according to matrix (3), the matrix of latent varibales is

$Z = \begin{bmatrix} -&\boldsymbol{z}_1^T&-\\ -&\boldsymbol{z}_2^T&-\\ &\vdots&\\ -&\boldsymbol{z}_N^T&-\\ \end{bmatrix}\tag{4}$

Once we got these two matrices, basing on the equation (1) in ‘Mixtures of Gaussians’ as following:

$p(\boldsymbol{x})= \sum_{k=1}^{K}\pi_k\mathcal{N}(\boldsymbol{x}|\boldsymbol{\mu}_k,\Sigma_k)\tag{5}$

the log-likelihood function is given by:

\begin{aligned} \ln p(\boldsymbol{x}|\boldsymbol{\pi},\boldsymbol{\mu},\Sigma)&=\ln (\Pi_{n=1}^N\sum_{k=1}^{K}\pi_k\mathcal{N}(\boldsymbol{x}|\boldsymbol{\mu}_k,\Sigma_k))\\ &=\sum_{n=1}^{N}\ln \sum_{k=1}^{K}\pi_k\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\Sigma_k)\\ \end{aligned}\tag{6}

This looks different from the single Gaussian model where the logarithm operates directly on $\mathcal{N}(\boldsymbol{x}|\boldsymbol{\mu}_k,\Sigma_k)$ who is an exponential function. However, the existence of summation in the logarithm makes the problem hard to solve.

And for the combination of the Gaussian mixture is arbitrary. We could have $K!$ equivalent solutions. So which one we get did not effect on our model.

## Maximum Likelihood Could Fail

The other differences from the single Gaussian model are the singularity of the covariance matrix and the condition $\boldsymbol{x}_n=\boldsymbol{\mu}_j$.

### Covariance Matrix $\Sigma$ is not Singular

For in the Gaussian distribution, the covariance matrix must be able to be inverted. So in our following discussion, we assume all the covariance matrice are invisible. For simplicity we take $\Sigma_k=\delta_k^2 I$ where $I$ is identity matrix.

### When $\boldsymbol{x}_n=\boldsymbol{\mu}_j$

When a point in sample accidentally equals to the mean $\mu_j$, the Gaussin distribution of the random variable $\boldsymbol{x}_n$ is:

$\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\delta_j^2I)=\frac{1}{2\pi^{\frac{1}{2}}}\frac{1}{\delta_j}\tag{7}$

When the variance $\delta_j\to 0$, this part goes to infinity and the whole algorithm failed.

This problem does not exist in a single Gaussian model, for the $\boldsymbol{x}_n-\boldsymbol{\mu}_j=0$ is not an exponent in its log-likelihood.

## Conclusion

The maximum likelihood method is not suited for a Gaussian mixture model. Then we introduce the EM algorithm in the next post.

## References

1. Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006. ↩︎