We have learned ‘steepest descent method’ and “Newton’s method”. They have advantages and limits at the same time. The main advantage of Newton’s method is the speed, it converges quickly. And the main advantage of the steepest descent method guarantees to converge to a local minimum. But the limits of Newton’s method is that it needs too much resource of both computation and storage when the number of parameters is large. And the speed of the steepest descent method is too slow. What we are going to research is what we can do to mix them up to a new algorithm who needs fewer resources but has a high speed. In other words, we use first-order derivatives but still have quadratic termination.

Recall steepest descent method whose search direction at consecutive iterations are orthogonal and for a quadratic function, the trajectory of the method is zig-zag.

Our target here is simplifyed to a quadratic function to have a insight into the process of the new method:

$F(\boldsymbol{x})=\frac{1}{2}\boldsymbol{x}^TA\boldsymbol{x}+\boldsymbol{d}\boldsymbol{x}+c\tag{1}$

Before we goint to the method, we need recall the linear algebra concept “conjugate” firstly:

A set of vectors $\{\boldsymbol{p}_k\}$ for $k=1,2,\cdots,n$ are mutually conjugate with respect to a positive definite matrix $A$ if and only if:

$\boldsymbol{p}^T_kA\boldsymbol{p}_j=0\tag{2}$

where $j\neq k$

for the vectors in a conjugate set is linear independent, so if there are $n$ vectors in a conjugate set, they can span a $n$-dimansion space. And to a certain space, there are inifinity numbers of conjugate sets. For instance, eigenvectors of $A$ in equation(1) are $\{\boldsymbol{z}_1,\cdots,\boldsymbol{z}_n\}$ and eigenvalues are $\{\lambda_1,\cdots,\lambda_n\}$ and we have:

$\boldsymbol{z}_k^TA\boldsymbol{z}_j=\lambda_j\boldsymbol{z}_k^T\boldsymbol{z}_j\tag{3}$

when the matrix $A$ is positive definite, eigenvectors are mutually orthogonal. And then equation(3) equal to $0$ when $k\neq j$

‘Quadratic function’ told us the eigenvectors are principal axes, and searching along these eigenvectors can finally minimize the quadratic function.

The calculation of a Hessian matrix should be avoided in our new method because that needs too much time to compute and too much space to restore the temporary data. So this trajectory along eigenvectors would not be used. Then an idea comes to us that other conjugate sets may also give a trajectory to the minimum. And mathematicians did proof our guess is correct and searching along $n$ vectors conjugate set $\{\boldsymbol{p_1},\dots,\boldsymbol{p}_n\}$ of a matrix $A$ can converge to the minimum in at most $n$ searches.

Basing on this theory, what we should do next is find a conjugate set of a matrix without the use of the Hessian matrix. We restate that:

$\nabla F(\boldsymbol{x})=\boldsymbol{g}=A\boldsymbol{x}+\boldsymbol{d}\tag{4}$

and

$\nabla^2 F(\boldsymbol{x})=A\tag{5}$

in the steepest descent method, each step we always search along the negative direction of gradient $\boldsymbol{g}$. Then the change between the consecutive step is:

$\Delta \boldsymbol{g}_k = \boldsymbol{g}_{k+1}-\boldsymbol{g}_k\tag{6}$

Insert equation(4) into equation(6):

\begin{aligned} \Delta \boldsymbol{g}_k &= \boldsymbol{g}_{k+1}-\boldsymbol{g}_k\\ &=A\boldsymbol{x}_{k+1}+\boldsymbol{d} -A\boldsymbol{x}_{k}+\boldsymbol{d}\\ &=A\Delta\boldsymbol{x} \end{aligned} \tag{7}

equation(7) is the key point in this new algorithm because this equation replace $A\Delta\boldsymbol{x}$ by $\Delta \boldsymbol{g}_k$ to avoid the calculation about matrix $A$. Andfor in the post we update the variables in each step by $\boldsymbol{x}_{k+1} =\boldsymbol{x}_k +\alpha\boldsymbol{p}_k$ then we have:

$\Delta\boldsymbol{x} =\boldsymbol{x}_{k+1} -\boldsymbol{x}_k =\alpha\boldsymbol{p}_k\tag{8}$

equation(2) is also established as:

$\alpha\boldsymbol{p}^T_kA\boldsymbol{p}_j=0\tag{9}$

and take equation(8) into equation(9):

$\alpha\boldsymbol{p}^T_kA\boldsymbol{p}_j=\Delta\boldsymbol{x}^TA\boldsymbol{p}_j=0\tag{10}$

Then the key point equation(7) would be used to replace the $\Delta\boldsymbol{x}^TA$ in equation(10):

$\Delta\boldsymbol{x}^TA\boldsymbol{p}_j=\boldsymbol{g}^T\boldsymbol{p}_j=0\tag{11}$

where $j\neq k$

Till now, we have all of what we need to build a method acting like a second-order method with just the amount of calculation of first-order method. And the procedure of method is:

1. initialize $\boldsymbol{p}_0=-\boldsymbol{g}_0$
2. construct a vector, the next direction, $\boldsymbol{p}_k$ orthogonal to $\{\Delta\boldsymbol{g}_0,\Delta\boldsymbol{g}_1,\cdots,\Delta\boldsymbol{g}_{k-1}\}$(somehow it looks like Gram-Schimidt method)
3. $\boldsymbol{p}_k=-\boldsymbol{g}_k+\beta_k\boldsymbol{p}_{k-1}$
4. $\beta_k$ is a scalar and can be calculated in three kinds of ways:
1. $\beta_k=\frac{\Delta\boldsymbol{g}_{k-1}^T\boldsymbol{g}_k}{\Delta\boldsymbol{g}_{k-1}^T\boldsymbol{p}_{k-1}}$
2. $\beta_k=\frac{\boldsymbol{g}_{k}^T\boldsymbol{g}_k}{\boldsymbol{g}_{k-1}^T\boldsymbol{p}_{k-1}}$
3. $\beta_k=\frac{\Delta\boldsymbol{g}_{k-1}^T\boldsymbol{g}_k}{\boldsymbol{g}_{k-1}^T\boldsymbol{g}_{k-1}}$

A simple example:

$F(\boldsymbol{x})=\frac{1}{2}\boldsymbol{x}^T\begin{bmatrix}2&1\\1&2\end{bmatrix}\boldsymbol{x}$

## References

1. Demuth, H.B., Beale, M.H., De Jess, O. and Hagan, M.T., 2014. Neural network design. Martin Hagan. ↩︎