# Steepest Descent Method

**Keywords:** steepest descent

## Direction Based Algorithm and a Variation^{[1]}

‘An Introduction to Performance Optimization’ had given us a brief introduction to the optimization algorithm. And this post describes a typical direction($\boldsymbol{p}_k$) based algorithm and its variation gives an algorithm which is a step-length($\alpha_k$) based algorithm.

## Steepest Descent

To find the minimum points of a performance index by an iterative algorithm, we want to decrease the value of performance index step by step which looks like going down from the mountain. And the key point of this algorithm is every step is going down. Every iteration makes the performance index decrease:

$F(\boldsymbol{x}_{k+1})<F(\boldsymbol{x}_{k})\tag{1}$

Our mission is to find the direction $\boldsymbol{p}_k$ with a relatively short step length $\alpha_k$ which leads us downhill.

The first-order Taylor series of an iterative step is:

$F(\boldsymbol{x}_{k+1})=F(\boldsymbol{x}_{k}+\Delta \boldsymbol{x}_k)\approx F(\boldsymbol{x}_{k})+\boldsymbol{g}^T\Delta\boldsymbol{x}_k\tag{2}$

where $\boldsymbol{g}_k$ is the gradient at position $\boldsymbol{x}$ of the performance index $F(\boldsymbol{x})$ which means:

$\boldsymbol{g}_k = \nabla F(\boldsymbol{x})\bigg |_{\boldsymbol{x}=\boldsymbol{x}_k}\tag{3}$

From equation(1) and equation(2) for the purpose $F(\boldsymbol{x}_{k+1})<F(\boldsymbol{x}_{k})$, we need:

$\boldsymbol{g}^T\Delta\boldsymbol{x}_k<0\tag{4}$

( $\Delta\boldsymbol{x}_k$ , the change of $\boldsymbol{x}_k$, can also be represented by step length $\alpha_k$ and direction $\boldsymbol{p}$, then the equation(4) has a equivalent form:

$\alpha_k\boldsymbol{g}^T\Delta\boldsymbol{p}_k<0\tag{5}$

In the previous article, we have seen how to find the greatest value of $\boldsymbol{g}^T\Delta\boldsymbol{p}_k$ then we can find the smallest value of $\boldsymbol{g}^T\Delta\boldsymbol{p}_k$ in the same way. Then we got the smallest value this item is:

$-\boldsymbol{g}^T\boldsymbol{g}\tag{6}$

which means that the deepest direction we would search is:

$\Delta\boldsymbol{p}_k=-\boldsymbol{g}\tag{7}$

According the iterative optimization algorithm frame work’An Introduction to Performance Optimization’, the second step is :

$\boldsymbol{x}_{k+1}=\boldsymbol{x}_{k}-\alpha_k\boldsymbol{g}_k\tag{8}$

The step length is also called learning rate. And the choice of $\alpha_k$ can be:

- Minimizing $F(\boldsymbol{x})$ with $\alpha_k$ by minimizing along the line: $\boldsymbol{x}_k-\alpha_k\boldsymbol{g}_k$
- Fixed $\alpha_k$, such as $\alpha_k=0.002$
- Predetermined, like $\alpha_k=\frac{1}{k}$

An example:

$F(\boldsymbol{x})=x_1^2+25x_2^2\tag{9}$

- start at point $x=\begin{bmatrix}0.5\\0.5\end{bmatrix}$
- gradient of $F(\boldsymbol{x})$ is $\nabla F(\boldsymbol{x})=\begin{bmatrix}\frac{\partial F}{\partial x_1}\\\frac{\partial F}{\partial x_2}\end{bmatrix}=\begin{bmatrix}2x_1\\50x_2\end{bmatrix}$
- $\boldsymbol{g}_0=\nabla F(\boldsymbol{x})\bigg|_{\boldsymbol{x}=\boldsymbol{x}_0}=\begin{bmatrix}1\\25\end{bmatrix}$
- set $\alpha = 0.01$
- update: $\boldsymbol{x}_1=\boldsymbol{x}_0-0.01\boldsymbol{g}_0=\begin{bmatrix}0.5\\0.5\end{bmatrix}-0.01\begin{bmatrix}1\\25\end{bmatrix}=\begin{bmatrix}0.49\\0.25\end{bmatrix}$
- update: $\boldsymbol{x}_2=\boldsymbol{x}_1-0.01\boldsymbol{g}_1=\begin{bmatrix}0.49\\0.25\end{bmatrix}-0.01\begin{bmatrix}0.98\\12.5\end{bmatrix}=\begin{bmatrix}0.4802\\0.125\end{bmatrix}$
- go on updating until the smallest point is achieved.

The whole trajectory looks like:

and the learning rate, step length is a constant in this algorithm, however, we can test 2 different values and watch their behavior. When we set $\alpha=0.01$, we have:

and when we set $\alpha=0.02$, we have:

The gif illustrates these:

- At first several steps, descent speed is faster than later steps
- A greater learning rate seems to have a higher speed
- This algorithm can converge to the minimum point.

The second point gives us a new idea, what will the algorithms do when we have a relatively bigger learning rate. To be on the safe side we select a not so big learning rate $\alpha=0.05$, we have:

this algorithm diverges, which means it would never stop at the minimum and get farther and farther as steps going on. So we have to take care of the value of learning rate, a small learning rate can slow down the algorithm but a big one can break up the algorithm.

## Stable Learning Rates

To have a fast speed and converge to the minimu, we need to study the learning rate $\alpha$. To simplify the problem, we start with supposing the performance index is a quadratic funtion:

$F(\boldsymbol{x})=\frac{1}{2}\boldsymbol{x}^TA\boldsymbol{x}+\boldsymbol{d}^T\boldsymbol{x}+c\tag{10}$

and we have already known its gradient is:

$\nabla F(\boldsymbol{x})=A \boldsymbol{x}+\boldsymbol{d}\tag{11}$

we take equation(11) into update step equation(8), we have:

$\boldsymbol{x}_{k+1}=\boldsymbol{x}_k-\alpha\boldsymbol{g}_k=\boldsymbol{x}_k-\alpha(A \boldsymbol{x}_k+\boldsymbol{d})\tag{12}$

or an equivelant form:

$\boldsymbol{x}_{k+1}=(I-\alpha A)\boldsymbol{x}_k- \alpha\boldsymbol{d}\tag{13}$

In the linear algebra course or other courses, this equation is called ‘linear dynamic system’. To make the system stable, the eigenvalues of $I-\alpha A$ are less than one in magnitude. And $I-\alpha A$ has the same eigenvectors with $A$. Let $[\lambda_1,\lambda_2,\cdots,\lambda_n]$ be the eigenvalues of $A$ and let $[\boldsymbol{z}_1,\boldsymbol{z}_2,\cdots,\boldsymbol{z}_n]$ be the eigenvectors of $A$

$[I-\alpha A]\boldsymbol{z}_i=\boldsymbol{z}_i-\alpha A\boldsymbol{z}_i=\boldsymbol{z}_i-\alpha \lambda_i\boldsymbol{z}_i=(1-\alpha\lambda_i)\boldsymbol{z}_i\tag{14}$

$I-\alpha A$ has the same eigenvectors with $A$ and has the eigenvalues: $[1-\alpha\lambda_1,1-\alpha\lambda_2,\cdots,1-\alpha\lambda_n]$

Concerning the equation(13) and eigenvalues of $I-\alpha A$ to stabilize the system which here is the steepest descent algorithm, we need:

$\begin{aligned} &|1-\alpha \lambda_i|&<1\\ -1&<1-\alpha \lambda_i&<1\\ -2&<-\alpha \lambda_i&<0\\ \end{aligned}\tag{15}$

for $\alpha>0$

$\begin{aligned} \frac{2}{\alpha}&> \lambda_i&>0\\ \end{aligned}\tag{16}$

frome equation(16), we finally have

$\frac{2}{\lambda_i}>\alpha\tag{17}$

which implies:

$\frac{2}{\lambda_{\text{max}}}>\alpha\tag{18}$

This gives us the maximum stable learning rate is inversely proportional to the maximum curvature(direction along with eigenvector according to the maximum eigenvalue $\lambda_\text{max}$) of the quadratic function.

Let’s go back to the example, the Hessian matrix of the equation(9) is:

$A=\begin{bmatrix} 2&0\\0&50 \end{bmatrix}\tag{19}$

its eigenvectors and eigenvalues are:

$\{\lambda_1=2,\boldsymbol{z}_1=\begin{bmatrix}1\\0\end{bmatrix}\},\{\lambda_2=50,\boldsymbol{z}_2=\begin{bmatrix}0\\1\end{bmatrix}\}\tag{20}$

taking $\lambda_\text{max}=\lambda_2=50$ into equation(18), we get:

$\alpha_\text{max}<\frac{2}{50}=0.04\tag{21}$

so, let’s check the behavior of the algorithm when $\alpha=0.039$ and $\alpha=0.041$:

- set $\alpha=0.039$ we have:
- set $\alpha=0.041$ we have:

Up to now, both concepts and implement have been built to prove the correctness of the algorithm. And we also have the following tips:

- The algorithm tends to converge most quickly in the direction of the eigenvector corresponding to the largest eigenvalue
- The algorithm tends to converge most slowly in the direction of the eigenvector corresponding to the smallest eigenvalue
- Do not overshoot the minimum point for the too-long step(learning rate $\alpha$)
- If the distinction between $\lambda_\text{min}$ and $\lambda_\text{max}$, the algorithm tends to converge slowly.

## Minimizing along a Line

The section above gives us the upper bound of $\alpha$, but we have three kinds of strategies of selecting a $\alpha$ and the first one is "Minimizing $F(\boldsymbol{x})$ with $\alpha_k$ by minimizing along the line: $\boldsymbol{x}_k-\alpha_k\boldsymbol{g}_k$". What shall we do with this kind of $\alpha$?

To arbitrary functions, the stationary point along a direction can be calculated by:

$\begin{aligned} &\frac{dF}{d\alpha_k}(\boldsymbol{x}_k+\alpha_k\boldsymbol{p}_k)\\ &=\nabla F(\boldsymbol{x})^T\bigg|_{\boldsymbol{x}=\boldsymbol{x}_k}\boldsymbol{p}_k+\alpha_k\boldsymbol{p}_k^T\nabla F^2(\boldsymbol{x})^T\bigg|_{\boldsymbol{x}=\boldsymbol{x}_k}\boldsymbol{p}_k\\ &=0 \end{aligned}\tag{22}$

and then

$\alpha=-\frac{\nabla F(\boldsymbol{x})^T\bigg|_{\boldsymbol{x}=\boldsymbol{x}_k}\boldsymbol{p}_k}{\boldsymbol{p}_k^T\nabla F^2(\boldsymbol{x})\bigg|_{\boldsymbol{x}=\boldsymbol{x}_k}\boldsymbol{p}_k}=-\frac{\boldsymbol{g}^T_k\boldsymbol{p}_k}{\boldsymbol{p}_k^TA_k\boldsymbol{p}_k}\tag{23}$

where: $A_k$ is the Hessian matrix of old guess $\boldsymbol{x}_k$:

$A_k=\nabla^2F(\boldsymbol{x})\bigg|_{\boldsymbol{x}=\boldsymbol{x}_k}\tag{24}$

Here we look at an example:

$F(x)=\frac{1}{2}\boldsymbol{x}^T\begin{bmatrix} 2&1\\1&2 \end{bmatrix}\boldsymbol{x}\tag{25}$

with the initial guess

$\boldsymbol{x}_0=\begin{bmatrix}0.8\\-0.25\end{bmatrix}\tag{26}$

the gradient of the function is

$\nabla F(\boldsymbol{x})=\begin{bmatrix}2x_1+x_2\\x_1+2x_2\end{bmatrix}\tag{27}$

the initial direction of the algorithm is

$\boldsymbol{p}_0=-\boldsymbol{g}_0=-\nabla F(\boldsymbol{x})\bigg|_{\boldsymbol{x}=\boldsymbol{x}_k}=\begin{bmatrix}-1.35\\-0.3\end{bmatrix}\tag{28}$

and take equation(28) and $A=\begin{bmatrix}2&1\\1&2\end{bmatrix}$into equation(23):

$\alpha_0=\frac{\begin{bmatrix}1.35&0.3\end{bmatrix}\begin{bmatrix}2&1\\1&2\end{bmatrix}}{\begin{bmatrix}1.35&0.3\end{bmatrix}\begin{bmatrix}2&1\\1&2\end{bmatrix}\begin{bmatrix}1.35\\0.3\end{bmatrix}}=0.413\tag{29}$

and take all these data into equation(8):

$\boldsymbol{x}_1=\boldsymbol{x}_0-\alpha_0\boldsymbol{g}_0=\begin{bmatrix}0.8\\-0.25\end{bmatrix}-0.413\begin{bmatrix}1.35\\0.3\end{bmatrix}=\begin{bmatrix}0.24\\-0.37\end{bmatrix}\tag{30}$

What is going on is repeating the steps: equation(26) to equation(30) until some terminative conditions are achieved. It works like:

All the new directions in the steps are othogonal to their last direction:

$\boldsymbol{p}_{k+1}^T\boldsymbol{p}_k=\boldsymbol{g}_{k+1}^T\boldsymbol{p}_k=0\tag{31}$

what we need now is to proof is $\boldsymbol{g}_{k+1}^T\boldsymbol{p}_k=0$, with the chain rule of equation(22):

$\begin{aligned} \frac{dF}{d\alpha_k}(\boldsymbol{x}_{k+1})&=\frac{dF}{d\alpha_k}(\boldsymbol{x}_k+\alpha_k\boldsymbol{p}_k)\\ &=\nabla F(\boldsymbol{x})^T\bigg|_{\boldsymbol{x}=\boldsymbol{x}_{k+1}}\frac{d}{d\alpha_k}(\boldsymbol{x}_k+\alpha_k\boldsymbol{p}_k)\\ &=\nabla F(\boldsymbol{x})^T\bigg|_{\boldsymbol{x}=\boldsymbol{x}_{k+1}}\boldsymbol{p}_k\\ &=0 \end{aligned}\tag{32}$

equation(32) gives us: a new direction is always orthogonal to the last step direction at the minimum point of the function along the last step direction. This is also an inspiration for another algorithm called conjugate direction.

## References

Demuth, H.B., Beale, M.H., De Jess, O. and Hagan, M.T., 2014. Neural network design. Martin Hagan. ↩︎