Steepest Descent Method

Keywords: steepest descent

Direction Based Algorithm and a Variation[1]

‘An Introduction to Performance Optimization’ had given us a brief introduction to the optimization algorithm. And this post describes a typical direction(pk\boldsymbol{p}_k) based algorithm and its variation gives an algorithm which is a step-length(αk\alpha_k) based algorithm.

Steepest Descent

To find the minimum points of a performance index by an iterative algorithm, we want to decrease the value of performance index step by step which looks like going down from the mountain. And the key point of this algorithm is every step is going down. Every iteration makes the performance index decrease:

F(xk+1)<F(xk)(1)F(\boldsymbol{x}_{k+1})<F(\boldsymbol{x}_{k})\tag{1}

Our mission is to find the direction pk\boldsymbol{p}_k with a relatively short step length αk\alpha_k which leads us downhill.

The first-order Taylor series of an iterative step is:

F(xk+1)=F(xk+Δxk)F(xk)+gTΔxk(2)F(\boldsymbol{x}_{k+1})=F(\boldsymbol{x}_{k}+\Delta \boldsymbol{x}_k)\approx F(\boldsymbol{x}_{k})+\boldsymbol{g}^T\Delta\boldsymbol{x}_k\tag{2}

where gk\boldsymbol{g}_k is the gradient at position x\boldsymbol{x} of the performance index F(x)F(\boldsymbol{x}) which means:

gk=F(x)x=xk(3)\boldsymbol{g}_k = \nabla F(\boldsymbol{x})\bigg |_{\boldsymbol{x}=\boldsymbol{x}_k}\tag{3}

From equation(1) and equation(2) for the purpose F(xk+1)<F(xk)F(\boldsymbol{x}_{k+1})<F(\boldsymbol{x}_{k}), we need:

gTΔxk<0(4)\boldsymbol{g}^T\Delta\boldsymbol{x}_k<0\tag{4}

( Δxk\Delta\boldsymbol{x}_k , the change of xk\boldsymbol{x}_k, can also be represented by step length αk\alpha_k and direction p\boldsymbol{p}, then the equation(4) has a equivalent form:

αkgTΔpk<0(5)\alpha_k\boldsymbol{g}^T\Delta\boldsymbol{p}_k<0\tag{5}

In the previous article, we have seen how to find the greatest value of gTΔpk\boldsymbol{g}^T\Delta\boldsymbol{p}_k then we can find the smallest value of gTΔpk\boldsymbol{g}^T\Delta\boldsymbol{p}_k in the same way. Then we got the smallest value this item is:

gTg(6)-\boldsymbol{g}^T\boldsymbol{g}\tag{6}

which means that the deepest direction we would search is:

Δpk=g(7)\Delta\boldsymbol{p}_k=-\boldsymbol{g}\tag{7}

According the iterative optimization algorithm frame work’An Introduction to Performance Optimization’, the second step is :

xk+1=xkαkgk(8)\boldsymbol{x}_{k+1}=\boldsymbol{x}_{k}-\alpha_k\boldsymbol{g}_k\tag{8}

The step length is also called learning rate. And the choice of αk\alpha_k can be:

  1. Minimizing F(x)F(\boldsymbol{x}) with αk\alpha_k by minimizing along the line: xkαkgk\boldsymbol{x}_k-\alpha_k\boldsymbol{g}_k
  2. Fixed αk\alpha_k, such as αk=0.002\alpha_k=0.002
  3. Predetermined, like αk=1k\alpha_k=\frac{1}{k}

An example:

F(x)=x12+25x22(9)F(\boldsymbol{x})=x_1^2+25x_2^2\tag{9}

  1. start at point x=[0.50.5]x=\begin{bmatrix}0.5\\0.5\end{bmatrix}
  2. gradient of F(x)F(\boldsymbol{x}) is F(x)=[Fx1Fx2]=[2x150x2]\nabla F(\boldsymbol{x})=\begin{bmatrix}\frac{\partial F}{\partial x_1}\\\frac{\partial F}{\partial x_2}\end{bmatrix}=\begin{bmatrix}2x_1\\50x_2\end{bmatrix}
  3. g0=F(x)x=x0=[125]\boldsymbol{g}_0=\nabla F(\boldsymbol{x})\bigg|_{\boldsymbol{x}=\boldsymbol{x}_0}=\begin{bmatrix}1\\25\end{bmatrix}
  4. set α=0.01\alpha = 0.01
  5. update: x1=x00.01g0=[0.50.5]0.01[125]=[0.490.25]\boldsymbol{x}_1=\boldsymbol{x}_0-0.01\boldsymbol{g}_0=\begin{bmatrix}0.5\\0.5\end{bmatrix}-0.01\begin{bmatrix}1\\25\end{bmatrix}=\begin{bmatrix}0.49\\0.25\end{bmatrix}
  6. update: x2=x10.01g1=[0.490.25]0.01[0.9812.5]=[0.48020.125]\boldsymbol{x}_2=\boldsymbol{x}_1-0.01\boldsymbol{g}_1=\begin{bmatrix}0.49\\0.25\end{bmatrix}-0.01\begin{bmatrix}0.98\\12.5\end{bmatrix}=\begin{bmatrix}0.4802\\0.125\end{bmatrix}
  7. go on updating until the smallest point is achieved.

The whole trajectory looks like:

and the learning rate, step length is a constant in this algorithm, however, we can test 2 different values and watch their behavior. When we set α=0.01\alpha=0.01, we have:

and when we set α=0.02\alpha=0.02, we have:

The gif illustrates these:

  1. At first several steps, descent speed is faster than later steps
  2. A greater learning rate seems to have a higher speed
  3. This algorithm can converge to the minimum point.

The second point gives us a new idea, what will the algorithms do when we have a relatively bigger learning rate. To be on the safe side we select a not so big learning rate α=0.05\alpha=0.05, we have:

this algorithm diverges, which means it would never stop at the minimum and get farther and farther as steps going on. So we have to take care of the value of learning rate, a small learning rate can slow down the algorithm but a big one can break up the algorithm.

Stable Learning Rates

To have a fast speed and converge to the minimu, we need to study the learning rate α\alpha. To simplify the problem, we start with supposing the performance index is a quadratic funtion:

F(x)=12xTAx+dTx+c(10)F(\boldsymbol{x})=\frac{1}{2}\boldsymbol{x}^TA\boldsymbol{x}+\boldsymbol{d}^T\boldsymbol{x}+c\tag{10}

and we have already known its gradient is:

F(x)=Ax+d(11)\nabla F(\boldsymbol{x})=A \boldsymbol{x}+\boldsymbol{d}\tag{11}

we take equation(11) into update step equation(8), we have:

xk+1=xkαgk=xkα(Axk+d)(12)\boldsymbol{x}_{k+1}=\boldsymbol{x}_k-\alpha\boldsymbol{g}_k=\boldsymbol{x}_k-\alpha(A \boldsymbol{x}_k+\boldsymbol{d})\tag{12}

or an equivelant form:

xk+1=(IαA)xkαd(13)\boldsymbol{x}_{k+1}=(I-\alpha A)\boldsymbol{x}_k- \alpha\boldsymbol{d}\tag{13}

In the linear algebra course or other courses, this equation is called ‘linear dynamic system’. To make the system stable, the eigenvalues of IαAI-\alpha A are less than one in magnitude. And IαAI-\alpha A has the same eigenvectors with AA. Let [λ1,λ2,,λn][\lambda_1,\lambda_2,\cdots,\lambda_n] be the eigenvalues of AA and let [z1,z2,,zn][\boldsymbol{z}_1,\boldsymbol{z}_2,\cdots,\boldsymbol{z}_n] be the eigenvectors of AA

[IαA]zi=ziαAzi=ziαλizi=(1αλi)zi(14)[I-\alpha A]\boldsymbol{z}_i=\boldsymbol{z}_i-\alpha A\boldsymbol{z}_i=\boldsymbol{z}_i-\alpha \lambda_i\boldsymbol{z}_i=(1-\alpha\lambda_i)\boldsymbol{z}_i\tag{14}

IαAI-\alpha A has the same eigenvectors with AA and has the eigenvalues: [1αλ1,1αλ2,,1αλn][1-\alpha\lambda_1,1-\alpha\lambda_2,\cdots,1-\alpha\lambda_n]

Concerning the equation(13) and eigenvalues of IαAI-\alpha A to stabilize the system which here is the steepest descent algorithm, we need:

1αλi<11<1αλi<12<αλi<0(15)\begin{aligned} &|1-\alpha \lambda_i|&<1\\ -1&<1-\alpha \lambda_i&<1\\ -2&<-\alpha \lambda_i&<0\\ \end{aligned}\tag{15}

for α>0\alpha>0

2α>λi>0(16)\begin{aligned} \frac{2}{\alpha}&> \lambda_i&>0\\ \end{aligned}\tag{16}

frome equation(16), we finally have

2λi>α(17)\frac{2}{\lambda_i}>\alpha\tag{17}

which implies:

2λmax>α(18)\frac{2}{\lambda_{\text{max}}}>\alpha\tag{18}

This gives us the maximum stable learning rate is inversely proportional to the maximum curvature(direction along with eigenvector according to the maximum eigenvalue λmax\lambda_\text{max}) of the quadratic function.

Let’s go back to the example, the Hessian matrix of the equation(9) is:

A=[20050](19)A=\begin{bmatrix} 2&0\\0&50 \end{bmatrix}\tag{19}

its eigenvectors and eigenvalues are:

{λ1=2,z1=[10]},{λ2=50,z2=[01]}(20)\{\lambda_1=2,\boldsymbol{z}_1=\begin{bmatrix}1\\0\end{bmatrix}\},\{\lambda_2=50,\boldsymbol{z}_2=\begin{bmatrix}0\\1\end{bmatrix}\}\tag{20}

taking λmax=λ2=50\lambda_\text{max}=\lambda_2=50 into equation(18), we get:

αmax<250=0.04(21)\alpha_\text{max}<\frac{2}{50}=0.04\tag{21}

so, let’s check the behavior of the algorithm when α=0.039\alpha=0.039 and α=0.041\alpha=0.041:

  1. set α=0.039\alpha=0.039 we have:
  2. set α=0.041\alpha=0.041 we have:

Up to now, both concepts and implement have been built to prove the correctness of the algorithm. And we also have the following tips:

  1. The algorithm tends to converge most quickly in the direction of the eigenvector corresponding to the largest eigenvalue
  2. The algorithm tends to converge most slowly in the direction of the eigenvector corresponding to the smallest eigenvalue
  3. Do not overshoot the minimum point for the too-long step(learning rate α\alpha)
  4. If the distinction between λmin\lambda_\text{min} and λmax\lambda_\text{max}, the algorithm tends to converge slowly.

Minimizing along a Line

The section above gives us the upper bound of α\alpha, but we have three kinds of strategies of selecting a α\alpha and the first one is "Minimizing F(x)F(\boldsymbol{x}) with αk\alpha_k by minimizing along the line: xkαkgk\boldsymbol{x}_k-\alpha_k\boldsymbol{g}_k". What shall we do with this kind of α\alpha?

To arbitrary functions, the stationary point along a direction can be calculated by:

dFdαk(xk+αkpk)=F(x)Tx=xkpk+αkpkTF2(x)Tx=xkpk=0(22)\begin{aligned} &\frac{dF}{d\alpha_k}(\boldsymbol{x}_k+\alpha_k\boldsymbol{p}_k)\\ &=\nabla F(\boldsymbol{x})^T\bigg|_{\boldsymbol{x}=\boldsymbol{x}_k}\boldsymbol{p}_k+\alpha_k\boldsymbol{p}_k^T\nabla F^2(\boldsymbol{x})^T\bigg|_{\boldsymbol{x}=\boldsymbol{x}_k}\boldsymbol{p}_k\\ &=0 \end{aligned}\tag{22}

and then

α=F(x)Tx=xkpkpkTF2(x)x=xkpk=gkTpkpkTAkpk(23)\alpha=-\frac{\nabla F(\boldsymbol{x})^T\bigg|_{\boldsymbol{x}=\boldsymbol{x}_k}\boldsymbol{p}_k}{\boldsymbol{p}_k^T\nabla F^2(\boldsymbol{x})\bigg|_{\boldsymbol{x}=\boldsymbol{x}_k}\boldsymbol{p}_k}=-\frac{\boldsymbol{g}^T_k\boldsymbol{p}_k}{\boldsymbol{p}_k^TA_k\boldsymbol{p}_k}\tag{23}

where: AkA_k is the Hessian matrix of old guess xk\boldsymbol{x}_k:

Ak=2F(x)x=xk(24)A_k=\nabla^2F(\boldsymbol{x})\bigg|_{\boldsymbol{x}=\boldsymbol{x}_k}\tag{24}

Here we look at an example:

F(x)=12xT[2112]x(25)F(x)=\frac{1}{2}\boldsymbol{x}^T\begin{bmatrix} 2&1\\1&2 \end{bmatrix}\boldsymbol{x}\tag{25}

with the initial guess

x0=[0.80.25](26)\boldsymbol{x}_0=\begin{bmatrix}0.8\\-0.25\end{bmatrix}\tag{26}

the gradient of the function is

F(x)=[2x1+x2x1+2x2](27)\nabla F(\boldsymbol{x})=\begin{bmatrix}2x_1+x_2\\x_1+2x_2\end{bmatrix}\tag{27}

the initial direction of the algorithm is

p0=g0=F(x)x=xk=[1.350.3](28)\boldsymbol{p}_0=-\boldsymbol{g}_0=-\nabla F(\boldsymbol{x})\bigg|_{\boldsymbol{x}=\boldsymbol{x}_k}=\begin{bmatrix}-1.35\\-0.3\end{bmatrix}\tag{28}

and take equation(28) and A=[2112]A=\begin{bmatrix}2&1\\1&2\end{bmatrix}into equation(23):

α0=[1.350.3][2112][1.350.3][2112][1.350.3]=0.413(29)\alpha_0=\frac{\begin{bmatrix}1.35&0.3\end{bmatrix}\begin{bmatrix}2&1\\1&2\end{bmatrix}}{\begin{bmatrix}1.35&0.3\end{bmatrix}\begin{bmatrix}2&1\\1&2\end{bmatrix}\begin{bmatrix}1.35\\0.3\end{bmatrix}}=0.413\tag{29}

and take all these data into equation(8):

x1=x0α0g0=[0.80.25]0.413[1.350.3]=[0.240.37](30)\boldsymbol{x}_1=\boldsymbol{x}_0-\alpha_0\boldsymbol{g}_0=\begin{bmatrix}0.8\\-0.25\end{bmatrix}-0.413\begin{bmatrix}1.35\\0.3\end{bmatrix}=\begin{bmatrix}0.24\\-0.37\end{bmatrix}\tag{30}

What is going on is repeating the steps: equation(26) to equation(30) until some terminative conditions are achieved. It works like:

All the new directions in the steps are othogonal to their last direction:

pk+1Tpk=gk+1Tpk=0(31)\boldsymbol{p}_{k+1}^T\boldsymbol{p}_k=\boldsymbol{g}_{k+1}^T\boldsymbol{p}_k=0\tag{31}

what we need now is to proof is gk+1Tpk=0\boldsymbol{g}_{k+1}^T\boldsymbol{p}_k=0, with the chain rule of equation(22):

dFdαk(xk+1)=dFdαk(xk+αkpk)=F(x)Tx=xk+1ddαk(xk+αkpk)=F(x)Tx=xk+1pk=0(32)\begin{aligned} \frac{dF}{d\alpha_k}(\boldsymbol{x}_{k+1})&=\frac{dF}{d\alpha_k}(\boldsymbol{x}_k+\alpha_k\boldsymbol{p}_k)\\ &=\nabla F(\boldsymbol{x})^T\bigg|_{\boldsymbol{x}=\boldsymbol{x}_{k+1}}\frac{d}{d\alpha_k}(\boldsymbol{x}_k+\alpha_k\boldsymbol{p}_k)\\ &=\nabla F(\boldsymbol{x})^T\bigg|_{\boldsymbol{x}=\boldsymbol{x}_{k+1}}\boldsymbol{p}_k\\ &=0 \end{aligned}\tag{32}

equation(32) gives us: a new direction is always orthogonal to the last step direction at the minimum point of the function along the last step direction. This is also an inspiration for another algorithm called conjugate direction.

References


  1. Demuth, H.B., Beale, M.H., De Jess, O. and Hagan, M.T., 2014. Neural network design. Martin Hagan. ↩︎