EM Algorithm

Keywords: EM Algorithm

EM Algorithm for Gaussian Mixture[1]

Analysis

Maximizing likelihood could not be used to the Gaussian mixture model directly, for its severe defects that we have come across at ‘Maximum Likelihood of Gaussian Mixtures’. By the inspiration of K-means, a two-step algorithm was developed.

The objective function is the log-likelihood function:

lnp(xπ,μ,Σ)=ln(Πn=1Nj=1KπkN(xμk,Σk))=n=1Nlnj=1KπjN(xnμj,Σj)(1)\begin{aligned} \ln p(\boldsymbol{x}|\boldsymbol{\pi},\boldsymbol{\mu},\Sigma)&=\ln (\Pi_{n=1}^N\sum_{j=1}^{K}\pi_k\mathcal{N}(\boldsymbol{x}|\boldsymbol{\mu}_k,\Sigma_k))\\ &=\sum_{n=1}^{N}\ln \sum_{j=1}^{K}\pi_j\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_j,\Sigma_j)\\ \end{aligned}\tag{1}

The condition that must be satisfied at a maximum of log-likelihood is the derivative(partial derivative) of parameters are 0. So we should calculate the partial derivatives of μk\mu_k:

lnp(Xπ,μ,Σ)μk=n=1NπkN(xnμk,Σk)Σk1(xnμk)j=1KπjN(xnμj,Σj)=n=1NπkN(xnμk,Σk)j=1KπjN(xnμj,Σj)Σk1(xnμk)(2)\begin{aligned} \frac{\partial \ln p(X|\pi,\mu,\Sigma)}{\partial \mu_k}&=\sum_{n=1}^N\frac{-\pi_k \mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\Sigma_k)\Sigma_k^{-1}(\boldsymbol{x}_n-\boldsymbol{\mu}_k)}{\sum_{j=1}^{K}\pi_j\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_j,\Sigma_j)}\\ &=-\sum_{n=1}^N\frac{\pi_k \mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\Sigma_k)}{\sum_{j=1}^{K}\pi_j\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_j,\Sigma_j)}\Sigma_k^{-1}(\boldsymbol{x}_n-\boldsymbol{\mu}_k) \end{aligned}\tag{2}

and then set equation (2) equal to 0 and rearrange it as:

n=1NπkN(xnμk,Σk)j=1KπjN(xnμj,Σj)xn=n=1NπkN(xnμk,Σk)j=1KπjN(xnμj,Σj)μk(3)\begin{aligned} \sum_{n=1}^N\frac{\pi_k \mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\Sigma_k)}{\sum_{j=1}^{K}\pi_j\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_j,\Sigma_j)}\boldsymbol{x}_n&=\sum_{n=1}^N\frac{\pi_k \mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\Sigma_k)}{\sum_{j=1}^{K}\pi_j\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_j,\Sigma_j)}\boldsymbol{\mu}_k \end{aligned}\tag{3}

in the post ‘Mixtures of Gaussians’, we had defined:

γnk=p(k=1xn)=πkN(xnμk,Σk)j=1KπjN(xnμj,Σj)(4)\gamma_{nk}=p(k=1|\boldsymbol{x}_n)=\frac{\pi_k \mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\Sigma_k)}{\sum_{j=1}^{K}\pi_j\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_j,\Sigma_j)}\tag{4}

called reposibility. And substitute equation(4) into equation(3):

n=1Nγnkxn=n=1Nγnkμkn=1Nγnkxn=μkn=1Nγnkμk=n=1Nγnkxnn=1Nγnk(5)\begin{aligned} \sum_{n=1}^N\gamma_{nk}\boldsymbol{x}_n&=\sum_{n=1}^N\gamma_{nk}\boldsymbol{\mu}_k\\ \sum_{n=1}^N\gamma_{nk}\boldsymbol{x}_n&=\boldsymbol{\mu}_k\sum_{n=1}^N\gamma_{nk}\\ {\mu}_k&=\frac{\sum_{n=1}^N\gamma_{nk}\boldsymbol{x}_n}{\sum_{n=1}^N\gamma_{nk}} \end{aligned}\tag{5}

and to simpilify equation (5) we define:

Nk=n=1Nγnk(6)N_k = \sum_{n=1}^N\gamma_{nk}\tag{6}

Then the equation (5) can be simplified as:

μk=1Nkn=1Nγnkxn(7){\mu}_k=\frac{1}{N_k}\sum_{n=1}^N\gamma_{nk}\boldsymbol{x}_n\tag{7}

The same calcualtion would be done to lnp(Xπ,μ,Σ)Σk=0\frac{\partial \ln p(X|\pi,\mu,\Sigma)}{\partial \Sigma_k}=0 :

Σk=1Nkn=1Nγnk(xnμk)(xnμk)T(8)\Sigma_k = \frac{1}{N_k}\sum_{n=1}^N\gamma_{nk}(\boldsymbol{x}_n - \boldsymbol{\mu_k})(\boldsymbol{x}_n - \boldsymbol{\mu_k})^T\tag{8}

However, the situation of πk\pi_k is a little complex, for it has a constrain:

kKπk=1(9)\sum_k^K \pi_k = 1 \tag{9}

then Lagrange multiplier is employed and the objective function is:

lnp(Xπ,μ,Σ)+λ(kKπk1)(10)\ln p(X|\boldsymbol{\pi},\boldsymbol{\mu},\Sigma)+\lambda (\sum_k^K \pi_k-1)\tag{10}

and set the partial derivative of equation (10) to πk\pi_k to 0:

0=n=1NN(xnμk,Σk)j=1KπjN(xnμj,Σj)+λ(11)0 = \sum_{n=1}^N\frac{\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\Sigma_k)}{\sum_{j=1}^{K}\pi_j\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_j,\Sigma_j)}+\lambda\tag{11}

And multiply both sides by πk\pi_k and sum over kk:

0=k=1K(n=1NπkN(xnμk,Σk)j=1KπjN(xnμj,Σj)+λπk)0=k=1Kn=1NπkN(xnμk,Σk)j=1KπjN(xnμj,Σj)+k=1Kλπk0=n=1Nk=1Kγnk+λk=1Kπkλ=N(12)\begin{aligned} 0 &= \sum_{k=1}^K(\sum_{n=1}^N\frac{\pi_k\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\Sigma_k)}{\sum_{j=1}^{K}\pi_j\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_j,\Sigma_j)}+\lambda\pi_k)\\ 0&=\sum_{k=1}^K\sum_{n=1}^N\frac{\pi_k\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\Sigma_k)}{\sum_{j=1}^{K}\pi_j\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_j,\Sigma_j)}+\sum_{k=1}^K\lambda\pi_k\\ 0&=\sum_{n=1}^N\sum_{k=1}^K\gamma_{nk}+\lambda\sum_{k=1}^K\pi_k\\ \lambda &= -N \end{aligned}\tag{12}

the last step of equation(12) is because k=1Kπk=1\sum_{k=1}^K\pi_k=1 and k=1Kγnk=1\sum_{k=1}^K\gamma_{nk}=1

Then we substitute equation(12) into eqa(11):

0=n=1NN(xnμk,Σk)j=1KπjN(xnμj,Σj)NN=1πkn=1Nγnkπk=NkN(13)\begin{aligned} 0 &= \sum_{n=1}^N\frac{\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\Sigma_k)}{\sum_{j=1}^{K}\pi_j\mathcal{N}(\boldsymbol{x}_n|\boldsymbol{\mu}_j,\Sigma_j)}-N\\ N &= \frac{1}{\pi_k}\sum_{n=1}^N\gamma_{nk}\\ \pi_k&=\frac{N_k}{N} \end{aligned}\tag{13}

the last step of equation (13) is because of the definition of equation (6).

Algorithm

Equation (5), (8) and (13) could not constitute a closed-form solution. The reason is that for example in equation (5), both side of the equation contains parameter μk\mu_k.

However, the equations suggest an iterative scheme for finding a solution which includes two-step: expectation and maximization:

  • E step: calculating the posterior probability of equation (4) with the current parameter
  • M step: update parameters by equation (5), (8) and (13)

The initial value of the parameters could be randomly selected. But some other tricks are always used, such as K-means. And the stop conditions can be one of:

  1. increase of log-likelihood falls below some threshold
  2. change of parameters falls below some threshold.

Python Code for EM

The input data should be normalized as what we did in ‘K-means algorithm’

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
def Gaussian( x, u, variance):
k = len(x)
return np.power(2*np.pi, -k/2.)*np.power(np.linalg.det(variance),
-1/2)*np.exp(-0.5*(x-u).dot(np.linalg.inv(variance)).dot((x-u).transpose()))


class EM():
def mixed_Gaussian(self,x,pi,u,covariance):
res = 0
for i in range(len(pi)):
res += pi[i]*Gaussian(x,u[i],covariance[i])
return res

def clusturing(self, x, d, initial_method='K_Means'):
data_dimension = x.shape[1]
data_size = x.shape[0]
if initial_method == 'K_Means':
km = k_means.K_Means()
# k_means initial mean vector, each row is a mean vector's transpose
centers, cluster_for_each_point = km.clusturing(x, d)
# initial latent variable pi
pi = np.ones(d)/d
# initial covariance
covariance = np.zeros((d,data_dimension,data_dimension))
for i in range(d):
covariance[i] = np.identity(data_dimension)/10.0
# calculate responsibility
responsibility = np.zeros((data_size,d))
log_likelihood = 0
log_likelihood_last_time = 0
for dummy in range(1,1000):
log_likelihood_last_time = log_likelihood
# E step:
# points in each class
k_class_dict = {i: [] for i in range(d)}
for i in range(data_size):
responsibility_numerator = np.zeros(d)
responsibility_denominator = 0
for j in range(d):
responsibility_numerator[j] = pi[j]*Gaussian(x[i],centers[j],covariance[j])
responsibility_denominator += responsibility_numerator[j]
for j in range(d):
responsibility[i][j] = responsibility_numerator[j]/responsibility_denominator

# M step:
N_k = np.zeros(d)
for j in range(d):
for i in range(data_size):
N_k[j] += responsibility[i][j]
for i in range(d):
# calculate mean
# sum of responsibility multiply x
sum_r_x = 0
for j in range(data_size):
sum_r_x += responsibility[j][i]*x[j]
if N_k[i] != 0:
centers[i] = 1/N_k[i]*sum_r_x
# covariance
# sum of responsibility multiply variance
sum_r_v = np.zeros((data_dimension,data_dimension))
for j in range(data_size):
temp = (x[j]-centers[i]).reshape(1,-1)
temp_T = (x[j]-centers[i]).reshape(-1,1)
sum_r_v += responsibility[j][i]*(temp_T.dot(temp))
if N_k[i] != 0:
covariance[i] = 1 / N_k[i] * sum_r_v
# latent pi
pi[i] = N_k[i]/data_size
log_likelihood = 0
for i in range(data_size):
log_likelihood += np.log(self.mixed_Gaussian(x[i], pi, centers, covariance))

if np.abs(log_likelihood - log_likelihood_last_time)<0.001:
break
print(log_likelihood_last_time)
return pi,centers,covariance

The entire project can be found https://github.com/Tony-Tan/ML and please star me.

The progress of EM(initial with K-means):

and the final result is:

where the ellipse represents the covariance matrix and the axes of the ellipse are in the direction of eigenvectors of the covariance matrix, and their length is corresponding eigenvalues.

References


  1. Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006. ↩︎