Widrow-Hoff Learning(Part I)

Keywords: Widrow-Hoff learning, ADALINE, Adaptive LInear NEuron, LMS, performance learning

ADALINE, LMS, and Widrow-Hoff learning[1]

Performance learning had been discussed in ‘Performance Surfaces and Optimum Points’. But we have not used it in any neural network. In this post, we talk about an important application of performance learning. And this new neural network was invented by Frank Widrow and his gradate student Marcian Hoff in 1960 when it was almost at the same time as Perceptron which had been discussed in ‘Perceptron Learning Rule’. It is called Widrow-Hoff Learning. This is an approximate steepest descent algorithm. And the performance index used in the learning rule is mean square error.

Perceptron was discussed because it is still used in the current task. It’s a kind of basic block of a current neural network as well. However, Widrow-Hoff learning is supposed to be discussed because it is:

  1. widely used in many signal processing
  2. and precursor to the backpropagation algorithm which is a very important aspect of current deep learning research.

ADALINE (Adaptive LInear NEuron) and a learning rule called LMS(Least Mean Square) algorithm were introduced in the paper ‘Adaptive switching circuits’. The only distinction between perceptron and ADALINE is only the transfer function which in perceptron is a hard-limiting but in ADALINE is linear. And they have the same inherent limitation: they can only deal with the linear separable problem. And the learning rule LMS algorithm is more powerful than the perceptron learning rule. The perceptron learning algorithm always gives a decision boundary through a training point(an example of the training set)or near a training point as we have talked in ‘Perceptron Learning Rule’. And the training point somehow can also be called a pattern in some cases. LMS had great success in signal processing but it did not work well in adapting the algorithm to the multilayer network. By the way, after 1906s Dr. Widrow gave up the research in neural networks and 1980s he went back to the neural network. Backpropagation is a descendant of the LMS algorithm.

ADALINE Network

It’s an abbreviated notation of ADALINE network is;

It can be notated mathematically as:

a=pureline(Wp+b)=Wp+b(1)\begin{aligned} \boldsymbol{a}&=\text{pureline}(W\boldsymbol{p} + \boldsymbol{b})\\ &= W\boldsymbol{p}+\boldsymbol{b} \end{aligned}\tag{1}

This is already a simple model, but we would like to use a more simplified model: a neuron with 2-inputs. Its abbreviated notation is:

whose mathematical notation is:

a=Wp+b=w1,1p1+w1,2p2+b(2)\begin{aligned} \boldsymbol{a}&= W\boldsymbol{p}+\boldsymbol{b}\\ &=w_{1,1}p_1+w_{1,2}p_2+b \end{aligned}\tag{2}

and decision boundary of this 2-input neuron is:

w\boldsymbol{w} is the vector consisting of the weights in neuron. And it always point to the region where w1,1p1+w1,2p2+b>0w_{1,1}p_1+w_{1,2}p_2+b>0

This is the simplest ADALINE neural network architecture, and then we would like to investigate ‘how to modify its parameter’ - its learning rule.

Mean Square Error

LMS is a kind of supervised learning and it needs a training set:

{p1,t1},{p2,t2},,{pQ,tQ}(3)\{\boldsymbol{p}_1,t_1\},\{\boldsymbol{p}_2,t_2\},\cdots,\{\boldsymbol{p}_Q,t_Q\}\tag{3}

And we used the information of the interaction between the output of the neural network of a certain input pi\boldsymbol{p}_i and its corresponding target tit_i

The information used in LMS is the mean square error(MSE), which is the difference between output and target. This is the performance index of the ADALINE neural network.

To make the calculating process more beautiful, we lump the parameters up including bias into:

x=[1wb](4)\boldsymbol{x}=\begin{bmatrix} _1\boldsymbol{w}\\ b \end{bmatrix}\tag{4}

and input lump up respectely:

z=[p1](5)\boldsymbol{z}=\begin{bmatrix} \boldsymbol{p}\\ 1 \end{bmatrix}\tag{5}

x\boldsymbol{x} is used to refer to as parameter for make it sucessive with the posts about performance optimization(‘Performance Surfaces and Optimum Points’)
and equation(1) became:

a=Wp+b=xTz(6)a=W\boldsymbol{p}+\boldsymbol{b}=\boldsymbol{x}^T\boldsymbol{z}\tag{6}

Expression for the ADALINE network mean square error:

F(x)=E[e2]=E[(ta)2]=E[(txTz)2](7)F(x)=\mathbb {E}[e^2]=\mathbb {E}[(t-a)^2]=\mathbb {E} [(t-\boldsymbol{x}^T\boldsymbol{z})^2]\tag{7}

where the symbol E\mathbb{E} is the expectation who is generalized definition – time average for the deterministic signal. Average is also called mean and the mean of the square of error so it is called MSE.

E[(txTz)2]=E[t2]2xTE[tz]+xTE[zTz]x(8)\mathbb{E}[(t-\boldsymbol{x}^T\boldsymbol{z})^2]=\mathbb{E}[t^2]-2\boldsymbol{x}^T\mathbb{E}[t\boldsymbol{z}]+\boldsymbol{x}^T\mathbb{E}[\boldsymbol{z}^T\boldsymbol{z}]\boldsymbol{x}\tag{8}

where x\boldsymbol{x} and tt is not a random variable so its expectation is itself. And equation(8) can be simplified as

F(x)=C2xTh+xTRx(9)F(\boldsymbol{x})=C-2\boldsymbol{x}^T\boldsymbol{h}+\boldsymbol{x}^TR\boldsymbol{x}\tag{9}

where:

  • h=tz\boldsymbol{h}=t\boldsymbol{z} in statistical view, this is cross-correlation between input and output.
  • R=E[zzT]R=\mathbb{E}[\boldsymbol{z}\boldsymbol{z}^T] is input correlation matrix whose diagonal iters are mean square value of input.
  • C=E[t2]C=\mathbb{E}[t^2] is a constant.

equation(9) is a ‘quadratic function’. and we can rewrite it in the form:

F(x=C+dTx+12xTAx)(10)F(\boldsymbol{x}=C+ \boldsymbol{d}^T\boldsymbol{x}+\frac{1}{2}\boldsymbol{x}^TA\boldsymbol{x})\tag{10}

where:

  • C=CC=C
  • dT=2hT\boldsymbol{d}^T=2\boldsymbol{h}^T
  • A=2R=2E[zzT]A=2R=2\mathbb{E}[\boldsymbol{z}\boldsymbol{z}^T] is positive definite or positive semidefinite matrix. So this means square error function could have:
    1. a strong minimum if RR is positive definite and could also have a weak minimum or
    2. no minimum if RR is positive semidefinite.

Then the stationary point could be at the point:

F(x)=2h+2Rx=0(11)\nabla F(\boldsymbol{x})=-2\boldsymbol{h}+2R\boldsymbol{x}=0\tag{11}

And finally we get the minimum point of MSE:

x=R1h(12)\boldsymbol{x}^{\star}=R^{-1}\boldsymbol{h}\tag{12}

because RR is comprised of inputs, so inputs decides the matrix RR and RR decides the minimum of MSE x\boldsymbol{x}^{\star}. This means when we have no information of the Hessian matrix of the performance index, our inputs are used to comprise the approximation of the Hessian matrix.

LMS Algorithm

LMS is the short for least mean square. And it is the algorithm for searching the minimum of the performance index.

When h\boldsymbol{h} and RR are known and stationary points can be found directly. If R1R^{-1} is impossible to calculate we can use ‘steepest descent algorithm’. However, the most common cause is both h\boldsymbol{h} and RR are unknown or are not convenient to be calculated. We would use approximate steepest descent in which we use an estimated gradient for this case.

And we also instead expectation of square error with a squared error:

F^(x)=(t(k)a(k)2)=e2(k)(13)\hat{F}(\boldsymbol{x})=(t(k)-a(k)^2)=e^2(k)\tag{13}

and the approximation of gradient is:

^F=e2(k)(14)\hat{\nabla}F=\nabla e^2(k)\tag{14}

This is the key point of the algorithm and it is also known as ‘sstochastic gradient’. And when this approximation is used in gradient descent algorithm it is referred to as

  • on-line learning
  • incremental learning

which means parameters are updated as each input is presented.

In the e2(k)\nabla e^2(k), the first RR components are the parts decided by the weights. So we have:

[e2(k)]j=e2(k)w1,j=2e(k)e(k)w1,j(15)[\nabla e^2(k)]_j=\frac{\partial e^2(k)}{\partial w_{1,j}}=2 e(k)\frac{\partial e(k)}{\partial w_{1,j}}\tag{15}

for j=1,2,,Rj=1,2,\cdots, R and similarly

[e2(k)]j+1=e2(k)b=2e(k)e(k)b(16)[\nabla e^2(k)]_{j+1}=\frac{\partial e^2(k)}{\partial b}=2 e(k)\frac{\partial e(k)}{b}\tag{16}

now we consider e(k)w1,j\frac{\partial e(k)}{\partial w_{1,j}} whose error is the difference between the output of ADALINE neuron with RR weights and target:

e(k)w1,j=[t(k)a(k)]w1,j=w1,j[t(k)(1WTp(k)+b)]=w1,j[t(k)(i=1Rw1,ipi(k)+b)]=pj(k)(17)\begin{aligned} \frac{\partial e(k)}{\partial w_{1,j}}&=\frac{\partial[t(k)-a(k)]}{\partial w_{1,j}}\\ &=\frac{\partial}{\partial w_{1,j}}[t(k)-(_1W^T\boldsymbol{p}(k)+b)]\\ &=\frac{\partial}{\partial w_{1,j}}[t(k)-(\sum_{i=1}^Rw_{1,i}p_i(k)+b)] &=-p_j(k) \end{aligned}\tag{17}

and similarly:

e(k)b=1(18)\frac{\partial e(k)}{b} = -1\tag{18}

and z\boldsymbol{z} is conprised of pjkp_j{k} and 11 as we mentioned above, so the equation(14) can be rewritten as:

^F=e2(k)=2e(k)z(k)(19)\hat{\nabla}F=\nabla e^2(k)=-2e(k)\boldsymbol{z}(k)\tag{19}

equation(19) gives us a beautiful form of the approximation of the gradient of squared error who is also an approximation of mean squared error.

Then we take approximation of F\nabla F into steepest descent algorithm:

xk+1=xkα^F(x)x=xk(20)\boldsymbol{x}_{k+1}=\boldsymbol{x}_{k}-\alpha \hat{\nabla} F(\boldsymbol{x})\bigg |_{\boldsymbol{x}=\boldsymbol{x}_k^{\star}}\tag{20}

and take ^F=2e(k)z(k)\hat{\nabla} F = -2e(k)\boldsymbol{z}(k) into equation(20):

xk+1=xk+2αe(k)z(k)(21)\boldsymbol{x}_{k+1}=\boldsymbol{x}_{k}+2\alpha e(k)\boldsymbol{z}(k)\tag{21}

This is an LMS algorithm which is also known as the delta rule and Widrow-Hoff Learning algorithm. To multiple neurons neural network, we have a matrix form:

W(k+1)=W(k)+2αe(k)pT(k)(22)W(k+1)=W(k)+2\alpha \boldsymbol{e}(k) \boldsymbol{p}^T(k)\tag{22}

Analysis of Convergence

We have analyzed the convergence of the steepest descent algorithm, when

α<2λmax(23)\alpha<\frac{2}{\lambda_{\text{max}}}\tag{23}

and to LMS, x(k)\boldsymbol{x}(k) is a function of z(k1),z(0)\boldsymbol{z}(k-1),\cdots \boldsymbol{z}(0) and we assume z(k1),z(0)\boldsymbol{z}(k-1),\cdots \boldsymbol{z}(0) are statistical independent and x(k)\boldsymbol{x}(k) is independent to z(k)\boldsymbol{z}(k) statistically.

Because what we have used is an approximation of the steepest descent algorithm which guarantees to converge to x=R1h\boldsymbol{x}^{\star}=R^{-1}\boldsymbol{h} for equation(9) and what we should do is to proof LMS converge to x=R1h\boldsymbol{x}^{\star}=R^{-1}\boldsymbol{h} as well.

The update procedure of LMS is:

xk+1=xk+2αe(k)z(k)(24)\boldsymbol{x}_{k+1}=\boldsymbol{x}_{k}+2\alpha e(k)\boldsymbol{z}(k)\tag{24}

The expectation of both side of equation(24):

E[xk+1]=E[xk]+2αE[e(k)z(k)](25)\mathbb{E}[\boldsymbol{x}_{k+1}]=\mathbb{E}[\boldsymbol{x}_{k}]+2\alpha \mathbb{E}[e(k)\boldsymbol{z}(k)]\tag{25}

substitude t(k)xkTz(k)t(k)-\boldsymbol{x}_k^T\boldsymbol{z}(k) for the error e(k)e(k) and get:

E[xk+1]=E[xk]+2αE[(t(k)xkTz(k))z(k)](26)\mathbb{E}[\boldsymbol{x}_{k+1}]=\mathbb{E}[\boldsymbol{x}_{k}]+2\alpha \mathbb{E}[(t(k)-\boldsymbol{x}_k^T\boldsymbol{z}(k))\boldsymbol{z}(k)] \tag{26}

substitude zT(k)xk\boldsymbol{z}^T(k)\boldsymbol{x}_k for
xkTz(k)\boldsymbol{x}^T_k\boldsymbol{z}(k) and rearrange terms to:

E[xk+1]=E[xk]+2α{E[t(k)z(k)]E[z(k)zT(k)xk]}(27)\mathbb{E}[\boldsymbol{x}_{k+1}]=\mathbb{E}[\boldsymbol{x}_{k}]+2\alpha \{\mathbb{E}[t(k)\boldsymbol{z}(k)]-\mathbb{E}[\boldsymbol{z}(k)\boldsymbol{z}^T(k)\boldsymbol{x}_k]\} \tag{27}

because xk\boldsymbol{x}_k is independent of z(k)\boldsymbol{z}(k):

E[xk+1]=E[xk]+2α{hRE[xk]}(28)\mathbb{E}[\boldsymbol{x}_{k+1}]=\mathbb{E}[\boldsymbol{x}_{k}]+2\alpha \{\boldsymbol{h}-R\mathbb{E}[\boldsymbol{x}_k]\} \tag{28}

and this can also be written as:

E[xk+1]=[I2αR]E[xk]+2αh(29)\mathbb{E}[\boldsymbol{x}_{k+1}]=[I-2\alpha R]\mathbb{E}[\boldsymbol{x}_{k}]+2\alpha \boldsymbol{h} \tag{29}

this is also a dynamic system and to make it stable, eigenvalues of I2αRI-2\alpha R should be greater than 1-1 and less than 11. And this form of the matrix has the same eigenvectors of RR and its eigenvalues are 12αλi1-2\alpha \lambda_i so:

1>12αλi>10<α<1λi(30)\begin{aligned} 1>1-2\alpha\lambda_i>&-1\\ 0<\alpha<&\frac{1}{\lambda_i} \end{aligned}\tag{30}

equation(30) is equivalent to:

α<1λmax(31) \alpha<\frac{1}{\lambda_\text{max}}\tag{31}

Because of A=2RA=2R so equation (31) gives the same condition as ‘steepest descent algorithm’

Under this condition, when the system become satable, we would have:

E[xss]=[I2αR]E[xss]+2αh(32)\mathbb{E}[\boldsymbol{x}_{\text{ss}}]=[I-2\alpha R]\mathbb{E}[\boldsymbol{x}_{\text{ss}}] +2 \alpha \boldsymbol{h} \tag{32}

or

E[xss]=R1h=x(33)\mathbb{E}[\boldsymbol{x}_{\text{ss}}]=R^{-1}\boldsymbol{h}=\boldsymbol{x}^{\star} \tag{33}

Equation(33) shows that LMS solution which is obtained by one by one input from time to time would finally give a convergence solution as the one whose Hessian matrix is known.

References


  1. Demuth, H.B., Beale, M.H., De Jess, O. and Hagan, M.T., 2014. Neural network design. Martin Hagan. ↩︎