Keywords: maximum likelihood estimation

## Square Loss Function for Regression

To any input $\boldsymbol{x}$, our goal in a regression task is to give a prediction $\hat{y}=y(\boldsymbol{x})$ to approximate target $t$ where the function $y$ is the chosen hypothesis. And the difference between $t$ and $\hat{y}$ can be called ‘error’ or more precisely ‘loss’. Because in an approximation task, ‘error’ occures by chance and alway exists, but ‘loss’ is a good word to describe the difference. The loss can be written generally as function $\ell(y(\boldsymbol{x}),t)$. Intuitively, the smaller the loss, the better the approximation. So the expection of loss:

$\mathbb E[\ell]=\int\int \ell(y(\boldsymbol{x}),t)p(\boldsymbol{x},t)d \boldsymbol{x}dt\tag{1}$

should be small.

In probability viewpoint, the input vector $\boldsymbol{x}$, target $t$ and parameters in function $y$ are all random variables. So the expectation of loss function exists.

Considering the sqare error loss function $e=(y(\boldsymbol{x})-t)^2$, it is a way to present the difference between the prediction of our prediction and the target. And substitute the loss funtion into equation (1), we have:

$\mathbb E[\ell]=\int\int (y(\boldsymbol{x})-t)^2p(\boldsymbol{x},t)d \boldsymbol{x}dt\tag{2}$

To minimize this function, we could use Euler-Lagrange equation, Fundamental theorem of calculus and Fubini’s theorem:

Fubini’s theorem told us that we can change the order of intergration:

\begin{aligned} \mathbb E[\ell]&=\int\int (y(\boldsymbol{x})-t)^2p(\boldsymbol{x},t)d \boldsymbol{x}dt\\ &=\int\int (y(\boldsymbol{x})-t)^2p(\boldsymbol{x},t)dtd \boldsymbol{x} \end{aligned}\tag{3}

According to the Euler-Lagrange equation, we first create a new function $G(x,y,y')$:

$G(x,y,y')= \int (y(\boldsymbol{x})-t)^2p(\boldsymbol{x},t)dt\tag{4}$

Euler-Lagrange equation is used to minimize the equation (2):

$\frac{\partial G}{\partial y}-\frac{d}{dx}\frac{\partial G}{\partial y'}=0\tag{5}$

Because there is no $y'$ component in function $G()$. Then the equation:

$\frac{\partial G}{\partial y}=0\tag{6}$

becomes the necessary condition to minimize the equation (2):

$2\int (y(\boldsymbol{x})-t)p(\boldsymbol{x},t)dt=0 \tag{7}$

What we want to find to minimize the loss is predictor $y$, so rearrange the equation (6), and we get a good predictor that can minimize the square loss function :

\begin{aligned} \int (y(\boldsymbol{x})-t)p(\boldsymbol{x},t)dt&=0\\ \int y(\boldsymbol{x})p(\boldsymbol{x},t)dt-\int tp(\boldsymbol{x},t)dt&=0\\ y(\boldsymbol{x})\int p(\boldsymbol{x},t)dt&=\int tp(\boldsymbol{x},t)dt\\ y(\boldsymbol{x})&=\frac{\int tp(\boldsymbol{x},t)dt}{\int p(\boldsymbol{x},t)dt}\\ y(\boldsymbol{x})&=\frac{\int tp(\boldsymbol{x},t)dt}{p(\boldsymbol{x})}\\ y(\boldsymbol{x})&=\int tp(t|\boldsymbol{x})dt\\ y(\boldsymbol{x})&= \mathbb{E}_t[t|\boldsymbol{x}] \end{aligned}\tag{8}

To minimize the expection of the square loss function, we finally find the expection of $t$ given $\boldsymbol{x}$ is the optimum solution to the task which means the solution of $y(\boldsymbol{x})=\mathbb{E[t| \boldsymbol{x}]}$. The expection of $t$ given $\boldsymbol{x}$ is also called the regression function.

A small summary: $\mathbb{E[t| \boldsymbol{x}]}$ is a good expection of $y(\boldsymbol{x})$

## Maximum Likelihood Estimation

Generally, we assume that there is a generator behand the data:

$t=g(\boldsymbol{x},\boldsymbol{w})+\epsilon\tag{9}$

where the function $g(\boldsymbol{x},\boldsymbol{w})$ is a deterministic function, $t$ is the target variable and $\epsilon$ is zero mean Gaussian random variable with percision $\beta$ which is the inverse variance. Because of the property of Gaussian distribution, $t$ has a Gaussian distribution, with mean(expectation) $g(\boldsymbol{x},\boldsymbol{w})$ and percesion $\beta$. And recalling the standard form of Gaussian distribution:

\begin{aligned} \mathbb{P}(t|\boldsymbol{x},\boldsymbol{w},\beta)&=\mathcal{N}(t|g(\boldsymbol{x},\boldsymbol{w}),\beta^{-1})\\ &=\frac{\beta}{\sqrt{2\pi}}\mathrm{e}^{-\frac{1}{2}(\beta(x-\mu)^2)} \end{aligned}\tag{10}

What our task here is to approximate the generator in equation (9) with a linear function. Somehow, when we use the square loss function, the optimum solution for this task is $\mathbb{E}[t|\boldsymbol{x}]$ with respect of equation (8). And to the equation (10) the solution is:

$\mathbb{E}[t|\boldsymbol{x}]=g(\boldsymbol{x},\boldsymbol{w})\tag{11}$

Then we set our linear model as:

$y(x)=\boldsymbol{w}^T\boldsymbol{x}+b\tag{12}$

and this can be transformed as:

$y(x)= \begin{bmatrix} b&\boldsymbol{w}^T \end{bmatrix} \begin{bmatrix} 1\\ \boldsymbol{x} \end{bmatrix}=\boldsymbol{w}_a^T\boldsymbol{x}_a \tag{13}$

for short, we just write the $\boldsymbol{w}_a$ and $\boldsymbol{x}_a$ as $\boldsymbol{w}$ and $\boldsymbol{x}$. Then the linear model becomes:

$y(x)=\boldsymbol{w}^T\boldsymbol{x}\tag{14}$

and to get the prediction we need to find out what $\boldsymbol{w}$ is. As we mentioned above we consider all the parameter as a random variable, then the conditioned distribution of $\boldsymbol{w}$ is $\mathbb{P}(\boldsymbol{w}|\boldsymbol{t},\beta)$. $X$ or $\boldsymbol{x}$ could be omitted for its distribution will not be concerned. And the Bayesian theorem told us:

$\mathbb{P}(\boldsymbol{w}|\boldsymbol{t},\beta)=\frac{\mathbb{P}( \boldsymbol{t}|\boldsymbol{w},\beta) \mathbb{P}(\boldsymbol{w})} {\mathbb{P}(\boldsymbol{t})}=\frac{\text{Likelihood}\times \text{Prior}}{\text{Evidence}}\tag{15}$

We want to find the $\boldsymbol{w}^{\star}$ that maximise the posterior probability $\mathbb{P}(\boldsymbol{w}|\boldsymbol{t},\beta)$. Because $\mathbb{P}(\boldsymbol{t})$ and $\mathbb{P}(\boldsymbol{w})$ are constant. Then the maximum of likelihood $\mathbb{P}(\boldsymbol{t}|\boldsymbol{w},\beta)$ maximise the posterior probability.

\begin{aligned} \mathbb{P}(\boldsymbol{t}|\boldsymbol{w},\beta)&=\Pi_{i=0}^{N}\mathcal{N}(t_i|\boldsymbol{w}^T\boldsymbol{x}_i,\beta^{-1})\\ \ln \mathbb{P}(\boldsymbol{t}|\boldsymbol{w},\beta)&=\sum_{i=0}^{N}\ln \mathcal{N}(t_i|\boldsymbol{w}^T\boldsymbol{x}_i,\beta^{-1})\\ &=\sum_{i=0}^{N}\ln \frac{\beta}{\sqrt{2\pi}}\mathrm{e}^{-\frac{1}{2}(\beta(t_i-\boldsymbol{w}^T\boldsymbol{x}_i)^2)}\\ &=\sum_{i=0}^{N} \ln \beta - \sum_{i=0}^{N} \ln \sqrt{2\pi} - \frac{1}{2}\beta\sum_{i=0}^{N}(t_i-\boldsymbol{w}^T\boldsymbol{x}_i)^2 \end{aligned}\tag{16}

This gives us a wanderful result. The last part of the equation (16) has only the component $\frac{1}{2}\beta\sum_{i=0}^{N}(t_i-\boldsymbol{w}^T\boldsymbol{x}_i)^2$ that can be controled by us, because $\sum_{i=0}^{N} \ln \beta$ and $- \sum_{i=0}^{N} \ln \sqrt{2\pi}$ are decided by the assumptions. In other words, to maximise the likelihood, we just need to minimise:

$\sum_{i=0}^{N}(t_i-\boldsymbol{w}^T\boldsymbol{x}_i)^2\tag{17}$

This was just to minimize the sum of squares. Then this optimization problem went back to the least square problem.

## Least Square Estimation and Maximum Likelihood Estimation

When we assume there is a generator:

$t=g(\boldsymbol{x},\boldsymbol{w})+\epsilon\tag{18}$

behind the data, and $\epsilon$ has a zero-mean Gaussian distribution with any precision $\beta$, the maximum likelihood estimation finally converts to the least square estimation. This is not only worked for linear regression, because we have no assumption about what $g(\boldsymbol{x},\boldsymbol{w})$ is.

However, when the $\epsilon$ has a different distribution but not Gaussian distribution, the least square estimation will not be the optimum solution.

1. Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006. ↩︎