Maximum Likelihood Estimation

Keywords: maximum likelihood estimation

Square Loss Function for Regression[1]

To any input x\boldsymbol{x}, our goal in a regression task is to give a prediction y^=y(x)\hat{y}=y(\boldsymbol{x}) to approximate target tt where the function yy is the chosen hypothesis. And the difference between tt and y^\hat{y} can be called ‘error’ or more precisely ‘loss’. Because in an approximation task, ‘error’ occures by chance and alway exists, but ‘loss’ is a good word to describe the difference. The loss can be written generally as function (y(x),t)\ell(y(\boldsymbol{x}),t). Intuitively, the smaller the loss, the better the approximation. So the expection of loss:

E[]=(y(x),t)p(x,t)dxdt(1)\mathbb E[\ell]=\int\int \ell(y(\boldsymbol{x}),t)p(\boldsymbol{x},t)d \boldsymbol{x}dt\tag{1}

should be small.

In probability viewpoint, the input vector x\boldsymbol{x}, target tt and parameters in function yy are all random variables. So the expectation of loss function exists.

Considering the sqare error loss function e=(y(x)t)2e=(y(\boldsymbol{x})-t)^2, it is a way to present the difference between the prediction of our prediction and the target. And substitute the loss funtion into equation (1), we have:

E[]=(y(x)t)2p(x,t)dxdt(2)\mathbb E[\ell]=\int\int (y(\boldsymbol{x})-t)^2p(\boldsymbol{x},t)d \boldsymbol{x}dt\tag{2}

To minimize this function, we could use Euler-Lagrange equation, Fundamental theorem of calculus and Fubini’s theorem:

Fubini’s theorem told us that we can change the order of intergration:

E[]=(y(x)t)2p(x,t)dxdt=(y(x)t)2p(x,t)dtdx(3)\begin{aligned} \mathbb E[\ell]&=\int\int (y(\boldsymbol{x})-t)^2p(\boldsymbol{x},t)d \boldsymbol{x}dt\\ &=\int\int (y(\boldsymbol{x})-t)^2p(\boldsymbol{x},t)dtd \boldsymbol{x} \end{aligned}\tag{3}

According to the Euler-Lagrange equation, we first create a new function G(x,y,y)G(x,y,y'):

G(x,y,y)=(y(x)t)2p(x,t)dt(4)G(x,y,y')= \int (y(\boldsymbol{x})-t)^2p(\boldsymbol{x},t)dt\tag{4}

Euler-Lagrange equation is used to minimize the equation (2):

GyddxGy=0(5)\frac{\partial G}{\partial y}-\frac{d}{dx}\frac{\partial G}{\partial y'}=0\tag{5}

Because there is no yy' component in function G()G(). Then the equation:

Gy=0(6)\frac{\partial G}{\partial y}=0\tag{6}

becomes the necessary condition to minimize the equation (2):

2(y(x)t)p(x,t)dt=0(7)2\int (y(\boldsymbol{x})-t)p(\boldsymbol{x},t)dt=0 \tag{7}

What we want to find to minimize the loss is predictor yy, so rearrange the equation (6), and we get a good predictor that can minimize the square loss function :

(y(x)t)p(x,t)dt=0y(x)p(x,t)dttp(x,t)dt=0y(x)p(x,t)dt=tp(x,t)dty(x)=tp(x,t)dtp(x,t)dty(x)=tp(x,t)dtp(x)y(x)=tp(tx)dty(x)=Et[tx](8)\begin{aligned} \int (y(\boldsymbol{x})-t)p(\boldsymbol{x},t)dt&=0\\ \int y(\boldsymbol{x})p(\boldsymbol{x},t)dt-\int tp(\boldsymbol{x},t)dt&=0\\ y(\boldsymbol{x})\int p(\boldsymbol{x},t)dt&=\int tp(\boldsymbol{x},t)dt\\ y(\boldsymbol{x})&=\frac{\int tp(\boldsymbol{x},t)dt}{\int p(\boldsymbol{x},t)dt}\\ y(\boldsymbol{x})&=\frac{\int tp(\boldsymbol{x},t)dt}{p(\boldsymbol{x})}\\ y(\boldsymbol{x})&=\int tp(t|\boldsymbol{x})dt\\ y(\boldsymbol{x})&= \mathbb{E}_t[t|\boldsymbol{x}] \end{aligned}\tag{8}

To minimize the expection of the square loss function, we finally find the expection of tt given x\boldsymbol{x} is the optimum solution to the task which means the solution of y(x)=E[tx]y(\boldsymbol{x})=\mathbb{E[t| \boldsymbol{x}]}. The expection of tt given x\boldsymbol{x} is also called the regression function.

A small summary: E[tx]\mathbb{E[t| \boldsymbol{x}]} is a good expection of y(x)y(\boldsymbol{x})

Maximum Likelihood Estimation

Generally, we assume that there is a generator behand the data:


where the function g(x,w)g(\boldsymbol{x},\boldsymbol{w}) is a deterministic function, tt is the target variable and ϵ\epsilon is zero mean Gaussian random variable with percision β\beta which is the inverse variance. Because of the property of Gaussian distribution, tt has a Gaussian distribution, with mean(expectation) g(x,w)g(\boldsymbol{x},\boldsymbol{w}) and percesion β\beta. And recalling the standard form of Gaussian distribution:

P(tx,w,β)=N(tg(x,w),β1)=β2πe12(β(xμ)2)(10)\begin{aligned} \mathbb{P}(t|\boldsymbol{x},\boldsymbol{w},\beta)&=\mathcal{N}(t|g(\boldsymbol{x},\boldsymbol{w}),\beta^{-1})\\ &=\frac{\beta}{\sqrt{2\pi}}\mathrm{e}^{-\frac{1}{2}(\beta(x-\mu)^2)} \end{aligned}\tag{10}

What our task here is to approximate the generator in equation (9) with a linear function. Somehow, when we use the square loss function, the optimum solution for this task is E[tx]\mathbb{E}[t|\boldsymbol{x}] with respect of equation (8). And to the equation (10) the solution is:


Then we set our linear model as:


and this can be transformed as:

y(x)=[bwT][1x]=waTxa(13)y(x)= \begin{bmatrix} b&\boldsymbol{w}^T \end{bmatrix} \begin{bmatrix} 1\\ \boldsymbol{x} \end{bmatrix}=\boldsymbol{w}_a^T\boldsymbol{x}_a \tag{13}

for short, we just write the wa\boldsymbol{w}_a and xa\boldsymbol{x}_a as w\boldsymbol{w} and x\boldsymbol{x}. Then the linear model becomes:


and to get the prediction we need to find out what w\boldsymbol{w} is. As we mentioned above we consider all the parameter as a random variable, then the conditioned distribution of w\boldsymbol{w} is P(wt,β)\mathbb{P}(\boldsymbol{w}|\boldsymbol{t},\beta). XX or x\boldsymbol{x} could be omitted for its distribution will not be concerned. And the Bayesian theorem told us:

P(wt,β)=P(tw,β)P(w)P(t)=Likelihood×PriorEvidence(15)\mathbb{P}(\boldsymbol{w}|\boldsymbol{t},\beta)=\frac{\mathbb{P}( \boldsymbol{t}|\boldsymbol{w},\beta) \mathbb{P}(\boldsymbol{w})} {\mathbb{P}(\boldsymbol{t})}=\frac{\text{Likelihood}\times \text{Prior}}{\text{Evidence}}\tag{15}

We want to find the w\boldsymbol{w}^{\star} that maximise the posterior probability P(wt,β)\mathbb{P}(\boldsymbol{w}|\boldsymbol{t},\beta). Because P(t)\mathbb{P}(\boldsymbol{t}) and P(w)\mathbb{P}(\boldsymbol{w}) are constant. Then the maximum of likelihood P(tw,β)\mathbb{P}(\boldsymbol{t}|\boldsymbol{w},\beta) maximise the posterior probability.

P(tw,β)=Πi=0NN(tiwTxi,β1)lnP(tw,β)=i=0NlnN(tiwTxi,β1)=i=0Nlnβ2πe12(β(tiwTxi)2)=i=0Nlnβi=0Nln2π12βi=0N(tiwTxi)2(16)\begin{aligned} \mathbb{P}(\boldsymbol{t}|\boldsymbol{w},\beta)&=\Pi_{i=0}^{N}\mathcal{N}(t_i|\boldsymbol{w}^T\boldsymbol{x}_i,\beta^{-1})\\ \ln \mathbb{P}(\boldsymbol{t}|\boldsymbol{w},\beta)&=\sum_{i=0}^{N}\ln \mathcal{N}(t_i|\boldsymbol{w}^T\boldsymbol{x}_i,\beta^{-1})\\ &=\sum_{i=0}^{N}\ln \frac{\beta}{\sqrt{2\pi}}\mathrm{e}^{-\frac{1}{2}(\beta(t_i-\boldsymbol{w}^T\boldsymbol{x}_i)^2)}\\ &=\sum_{i=0}^{N} \ln \beta - \sum_{i=0}^{N} \ln \sqrt{2\pi} - \frac{1}{2}\beta\sum_{i=0}^{N}(t_i-\boldsymbol{w}^T\boldsymbol{x}_i)^2 \end{aligned}\tag{16}

This gives us a wanderful result. The last part of the equation (16) has only the component 12βi=0N(tiwTxi)2\frac{1}{2}\beta\sum_{i=0}^{N}(t_i-\boldsymbol{w}^T\boldsymbol{x}_i)^2 that can be controled by us, because i=0Nlnβ\sum_{i=0}^{N} \ln \beta and i=0Nln2π- \sum_{i=0}^{N} \ln \sqrt{2\pi} are decided by the assumptions. In other words, to maximise the likelihood, we just need to minimise:


This was just to minimize the sum of squares. Then this optimization problem went back to the least square problem.

Least Square Estimation and Maximum Likelihood Estimation

When we assume there is a generator:


behind the data, and ϵ\epsilon has a zero-mean Gaussian distribution with any precision β\beta, the maximum likelihood estimation finally converts to the least square estimation. This is not only worked for linear regression, because we have no assumption about what g(x,w)g(\boldsymbol{x},\boldsymbol{w}) is.

However, when the ϵ\epsilon has a different distribution but not Gaussian distribution, the least square estimation will not be the optimum solution.


  1. Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006. ↩︎