Newton's Method

Keywords: Newton’s method

Newton’s Method[1]

Taylor series gives us the conditions for minimum points base on both first-order items and the second-order item. And first-order item approximation of a performance index function produced a powerful algorithm in locating the minimum points in the whole parameter space which we call it ‘steepest descent algorithm’. Now we want to have an insight into the second-order approximation of a function to find out whether there is an algorithm can also work as a guider to the minimum points. The approximation of F(xk+1)F(\boldsymbol{x}_{k+1}) is:

F(xk+1)=F(xk+Δxk)F(xk)+gkTΔxk+12ΔxkTAkΔxk(1)\begin{aligned} F(\boldsymbol{x}_{k+1})&=F(\boldsymbol{x}_k+\Delta \boldsymbol{x}_k)\\ &\approx F(\boldsymbol{x}_k)+\boldsymbol{g}^T_k\Delta \boldsymbol{x}_k+\frac{1}{2}\Delta \boldsymbol{x}^T_kA_k\Delta \boldsymbol{x}_k \end{aligned} \tag{1}

The main principle of the Taylor approximation that can help us locate the minimum points is that the stationary point of the function is close to the stationary point of the approximation. So the closer approximation is to the function, the closer the stationary point of approximation is to the stationary point of the function. The gradient of equation(1) is a special form of a gradient of a quadratic function:

F(xk+1)=Akxk+gk(2)\nabla F(\boldsymbol{x}_{k+1})=A_k\boldsymbol{x}_k+\boldsymbol{g}_k\tag{2}

To get the stationary points, we set equation(2) equal to 00:

AkΔxk+gk=0Δxk=Ak1gk(3)\begin{aligned} A_k\Delta \boldsymbol{x}_k+\boldsymbol{g}_k&=0\\ \Delta \boldsymbol{x}_k&=-A_k^{-1}\boldsymbol{g}_k \end{aligned} \tag{3}

then as the iterative algorithm framework said, we update xk\boldsymbol{x}_k which is also known as Newton’s Method:

xk+1=xk+Δxk=xkAk1gk(4)\boldsymbol{x}_{k+1}=\boldsymbol{x}_k+\Delta \boldsymbol{x}_k=\boldsymbol{x}_k-A_k^{-1}\boldsymbol{g}_k\tag{4}

Now it is the time to look at an example:

F(x)=x12+25x22(5)F(\boldsymbol{x})=x^2_1+25x_2^2\tag{5}

the gradient of equation(5) is:

F(x)=[x1F(x)x2F(x)]=[2x150x2](6)\nabla F(\boldsymbol{x})= \begin{bmatrix} \frac{\partial}{\partial x_1}F(\boldsymbol{x})\\ \frac{\partial}{\partial x_2}F(\boldsymbol{x}) \end{bmatrix}= \begin{bmatrix} 2x_1\\ 50x_2 \end{bmatrix}\tag{6}

the Hessian matrix is:

F(x)=[22x1F(x)2x1x2F(x)2x2x1F(x)22x2F(x)]=[20050](7)\nabla F(\boldsymbol{x})= \begin{bmatrix} \frac{\partial^2}{\partial^2 x_1}F(\boldsymbol{x})& \frac{\partial^2}{\partial x_1 \partial x_2}F(\boldsymbol{x})\\ \frac{\partial^2}{\partial x_2\partial x_1}F(\boldsymbol{x})&\frac{\partial^2}{\partial^2 x_2}F(\boldsymbol{x}) \end{bmatrix}= \begin{bmatrix} 2&0\\ 0&50 \end{bmatrix}\tag{7}

then we update x\boldsymbol{x} as equation(4) and with the intial point x0=[11]\boldsymbol{x}_0=\begin{bmatrix}1\\1\end{bmatrix}

x1=x0Ak1gk=[11][0.5000.02][250]=[00](8)\begin{aligned} \boldsymbol{x}_1&=\boldsymbol{x}_0-A_k^{-1}\boldsymbol{g}_k\\ &=\begin{bmatrix}1\\1\end{bmatrix}-\begin{bmatrix}0.5&0\\0&0.02\end{bmatrix}\begin{bmatrix}2\\50\end{bmatrix}\\ &=\begin{bmatrix}0\\0\end{bmatrix} \end{aligned} \tag{8}

Then we test x1=[00]\boldsymbol{x}_{1}=\begin{bmatrix}0\\0\end{bmatrix} with terminational condition. And then we stop this algorithm.

This one-step algorithm is not a coincidence. This method will always find the minimum of a quadratic function in 1 step. This is because the second-order approximation of the quadratic function is just itself in another form. So they have the same minimum and the stationary points of a quadratic function can be calculated analytically. However, when F(x)F(x) is not quadratic the method may:

  1. not generally converge in 1 step and
  2. be not sure whether the algorithm could converge or not which is dependent on both the function and the initial guess.

Let’s consider a function which is not qudratic:

F(x)=(x2x1)4+8x1x2x1+x2+3(9)F(\boldsymbol{x})=(x_2-x_1)^4 + 8x_1x_2-x_1+x_2+3\tag{9}

There are a local minimum, a global minimum, and a saddle point.

With the initial point x0=[1.50]\boldsymbol{x}_0=\begin{bmatrix}1.5\\0\end{bmatrix}

Then we keep updating the position xk\boldsymbol{x}_k with the equation(4) and we have:

The right side of the figure is the process of second-order Taylor approximation.

if we initial the process at the point x0=[0.750.75]\boldsymbol{x}_0=\begin{bmatrix}0.75\\0.75\end{bmatrix}. It will be looked like:

The right side of the figure is the process of second-order Taylor approximation.

and it converges to the saddle point.

Newton’s method converges quickly in many applications. The analytic function can be accurately approximated by quadratic in a small neighborhood of a strong minimum. And Newton’s method can not distinguish local minimum, global minimum and saddle points because to a quadratic function it has only one global minimum. When we use a quadratic function approximation, we consider the quadratic function only and forget the other parts of the original function where it is not able to be approximated by the quadratic function away. So what we can see through a quadratic approximate is only a small region where there is only one minimum or saddle point. And so we can not distinguish which kind of stationary points it is. And Newton’s method can produce very unpredictable results, too.

If the initial point is x0=[1.150.75]\boldsymbol{x}_0=\begin{bmatrix}1.15\\0.75\end{bmatrix} of the example above, it is far from minimum but it finally converges to the minimum point

The right side of the figure is the process of second-order Taylor approximation.

while the intial point x0=[0.750.75]\boldsymbol{x}_0=\begin{bmatrix}0.75\\0.75\end{bmatrix} which is more close to the minimum converge to the saddle points.

Conclusion

Newton’s method is:

  1. faster than steepest descent
  2. quite complex
  3. possible oscillate or diverge (steepest descent is guaranteed to converge when the learning rate is suitable)
  4. and there is a variation of Newton’s method suited to neural networks.
  5. the computation and storage of the Hessian matrix and its inverse could easily exhaust our computational resource
  6. Newton’s method degenerates to steepest descent method when Ak=Ak1=IA_k=A_k^{-1}=I

References


  1. Demuth, H.B., Beale, M.H., De Jess, O. and Hagan, M.T., 2014. Neural network design. Martin Hagan. ↩︎