Conjugate Gradient

Keywords: conjugate gradient

Conjugate Gradient[1]

We have learned ‘steepest descent method’ and “Newton’s method”. They have advantages and limits at the same time. The main advantage of Newton’s method is the speed, it converges quickly. And the main advantage of the steepest descent method guarantees to converge to a local minimum. But the limits of Newton’s method is that it needs too much resource of both computation and storage when the number of parameters is large. And the speed of the steepest descent method is too slow. What we are going to research is what we can do to mix them up to a new algorithm who needs fewer resources but has a high speed. In other words, we use first-order derivatives but still have quadratic termination.

Recall steepest descent method whose search direction at consecutive iterations are orthogonal and for a quadratic function, the trajectory of the method is zig-zag.

Our target here is simplifyed to a quadratic function to have a insight into the process of the new method:

F(x)=12xTAx+dx+c(1)F(\boldsymbol{x})=\frac{1}{2}\boldsymbol{x}^TA\boldsymbol{x}+\boldsymbol{d}\boldsymbol{x}+c\tag{1}

Before we goint to the method, we need recall the linear algebra concept “conjugate” firstly:

A set of vectors {pk}\{\boldsymbol{p}_k\} for k=1,2,,nk=1,2,\cdots,n are mutually conjugate with respect to a positive definite matrix AA if and only if:

pkTApj=0(2)\boldsymbol{p}^T_kA\boldsymbol{p}_j=0\tag{2}

where jkj\neq k

for the vectors in a conjugate set is linear independent, so if there are nn vectors in a conjugate set, they can span a nn-dimansion space. And to a certain space, there are inifinity numbers of conjugate sets. For instance, eigenvectors of AA in equation(1) are {z1,,zn}\{\boldsymbol{z}_1,\cdots,\boldsymbol{z}_n\} and eigenvalues are {λ1,,λn}\{\lambda_1,\cdots,\lambda_n\} and we have:

zkTAzj=λjzkTzj(3)\boldsymbol{z}_k^TA\boldsymbol{z}_j=\lambda_j\boldsymbol{z}_k^T\boldsymbol{z}_j\tag{3}

when the matrix AA is positive definite, eigenvectors are mutually orthogonal. And then equation(3) equal to 00 when kjk\neq j

‘Quadratic function’ told us the eigenvectors are principal axes, and searching along these eigenvectors can finally minimize the quadratic function.

The calculation of a Hessian matrix should be avoided in our new method because that needs too much time to compute and too much space to restore the temporary data. So this trajectory along eigenvectors would not be used. Then an idea comes to us that other conjugate sets may also give a trajectory to the minimum. And mathematicians did proof our guess is correct and searching along nn vectors conjugate set {p1,,pn}\{\boldsymbol{p_1},\dots,\boldsymbol{p}_n\} of a matrix AA can converge to the minimum in at most nn searches.

Basing on this theory, what we should do next is find a conjugate set of a matrix without the use of the Hessian matrix. We restate that:

F(x)=g=Ax+d(4)\nabla F(\boldsymbol{x})=\boldsymbol{g}=A\boldsymbol{x}+\boldsymbol{d}\tag{4}

and

2F(x)=A(5)\nabla^2 F(\boldsymbol{x})=A\tag{5}

in the steepest descent method, each step we always search along the negative direction of gradient g\boldsymbol{g}. Then the change between the consecutive step is:

Δgk=gk+1gk(6)\Delta \boldsymbol{g}_k = \boldsymbol{g}_{k+1}-\boldsymbol{g}_k\tag{6}

Insert equation(4) into equation(6):

Δgk=gk+1gk=Axk+1+dAxk+d=AΔx(7)\begin{aligned} \Delta \boldsymbol{g}_k &= \boldsymbol{g}_{k+1}-\boldsymbol{g}_k\\ &=A\boldsymbol{x}_{k+1}+\boldsymbol{d} -A\boldsymbol{x}_{k}+\boldsymbol{d}\\ &=A\Delta\boldsymbol{x} \end{aligned} \tag{7}

equation(7) is the key point in this new algorithm because this equation replace AΔxA\Delta\boldsymbol{x} by Δgk\Delta \boldsymbol{g}_k to avoid the calculation about matrix AA. Andfor in the post we update the variables in each step by xk+1=xk+αpk\boldsymbol{x}_{k+1} =\boldsymbol{x}_k +\alpha\boldsymbol{p}_k then we have:

Δx=xk+1xk=αpk(8)\Delta\boldsymbol{x} =\boldsymbol{x}_{k+1} -\boldsymbol{x}_k =\alpha\boldsymbol{p}_k\tag{8}

equation(2) is also established as:

αpkTApj=0(9)\alpha\boldsymbol{p}^T_kA\boldsymbol{p}_j=0\tag{9}

and take equation(8) into equation(9):

αpkTApj=ΔxTApj=0(10)\alpha\boldsymbol{p}^T_kA\boldsymbol{p}_j=\Delta\boldsymbol{x}^TA\boldsymbol{p}_j=0\tag{10}

Then the key point equation(7) would be used to replace the ΔxTA\Delta\boldsymbol{x}^TA in equation(10):

ΔxTApj=gTpj=0(11)\Delta\boldsymbol{x}^TA\boldsymbol{p}_j=\boldsymbol{g}^T\boldsymbol{p}_j=0\tag{11}

where jkj\neq k

Till now, we have all of what we need to build a method acting like a second-order method with just the amount of calculation of first-order method. And the procedure of method is:

  1. initialize p0=g0\boldsymbol{p}_0=-\boldsymbol{g}_0
  2. construct a vector, the next direction, pk\boldsymbol{p}_k orthogonal to {Δg0,Δg1,,Δgk1}\{\Delta\boldsymbol{g}_0,\Delta\boldsymbol{g}_1,\cdots,\Delta\boldsymbol{g}_{k-1}\}(somehow it looks like Gram-Schimidt method)
  3. pk=gk+βkpk1\boldsymbol{p}_k=-\boldsymbol{g}_k+\beta_k\boldsymbol{p}_{k-1}
  4. βk\beta_k is a scalar and can be calculated in three kinds of ways:
    1. βk=Δgk1TgkΔgk1Tpk1\beta_k=\frac{\Delta\boldsymbol{g}_{k-1}^T\boldsymbol{g}_k}{\Delta\boldsymbol{g}_{k-1}^T\boldsymbol{p}_{k-1}}
    2. βk=gkTgkgk1Tpk1\beta_k=\frac{\boldsymbol{g}_{k}^T\boldsymbol{g}_k}{\boldsymbol{g}_{k-1}^T\boldsymbol{p}_{k-1}}
    3. βk=Δgk1Tgkgk1Tgk1\beta_k=\frac{\Delta\boldsymbol{g}_{k-1}^T\boldsymbol{g}_k}{\boldsymbol{g}_{k-1}^T\boldsymbol{g}_{k-1}}

A simple example:

F(x)=12xT[2112]xF(\boldsymbol{x})=\frac{1}{2}\boldsymbol{x}^T\begin{bmatrix}2&1\\1&2\end{bmatrix}\boldsymbol{x}

References


  1. Demuth, H.B., Beale, M.H., De Jess, O. and Hagan, M.T., 2014. Neural network design. Martin Hagan. ↩︎