Keywords: maximum likelihood estimation
Square Loss Function for Regression
To any input , our goal in a regression task is to give a prediction to approximate target where the function is the chosen hypothesis. And the difference between and can be called ‘error’ or more precisely ‘loss’. Because in an approximation task, ‘error’ occures by chance and alway exists, but ‘loss’ is a good word to describe the difference. The loss can be written generally as function . Intuitively, the smaller the loss, the better the approximation. So the expection of loss:
should be small.
In probability viewpoint, the input vector , target and parameters in function are all random variables. So the expectation of loss function exists.
Considering the sqare error loss function , it is a way to present the difference between the prediction of our prediction and the target. And substitute the loss funtion into equation (1), we have:
To minimize this function, we could use Euler-Lagrange equation, Fundamental theorem of calculus and Fubini’s theorem:
Fubini’s theorem told us that we can change the order of intergration:
According to the Euler-Lagrange equation, we first create a new function :
Euler-Lagrange equation is used to minimize the equation (2):
Because there is no component in function . Then the equation:
becomes the necessary condition to minimize the equation (2):
What we want to find to minimize the loss is predictor , so rearrange the equation (6), and we get a good predictor that can minimize the square loss function :
To minimize the expection of the square loss function, we finally find the expection of given is the optimum solution to the task which means the solution of . The expection of given is also called the regression function.
A small summary: is a good expection of
Maximum Likelihood Estimation
Generally, we assume that there is a generator behand the data:
where the function is a deterministic function, is the target variable and is zero mean Gaussian random variable with percision which is the inverse variance. Because of the property of Gaussian distribution, has a Gaussian distribution, with mean(expectation) and percesion . And recalling the standard form of Gaussian distribution:
What our task here is to approximate the generator in equation (9) with a linear function. Somehow, when we use the square loss function, the optimum solution for this task is with respect of equation (8). And to the equation (10) the solution is:
Then we set our linear model as:
and this can be transformed as:
for short, we just write the and as and . Then the linear model becomes:
and to get the prediction we need to find out what is. As we mentioned above we consider all the parameter as a random variable, then the conditioned distribution of is . or could be omitted for its distribution will not be concerned. And the Bayesian theorem told us:
We want to find the that maximise the posterior probability . Because and are constant. Then the maximum of likelihood maximise the posterior probability.
This gives us a wanderful result. The last part of the equation (16) has only the component that can be controled by us, because and are decided by the assumptions. In other words, to maximise the likelihood, we just need to minimise:
This was just to minimize the sum of squares. Then this optimization problem went back to the least square problem.
Least Square Estimation and Maximum Likelihood Estimation
When we assume there is a generator:
behind the data, and has a zero-mean Gaussian distribution with any precision , the maximum likelihood estimation finally converts to the least square estimation. This is not only worked for linear regression, because we have no assumption about what is.
However, when the has a different distribution but not Gaussian distribution, the least square estimation will not be the optimum solution.