## ADALINE, LMS, and Widrow-Hoff learning[1]

Performance learning had been discussed in ‘Performance Surfaces and Optimum Points’. But we have not used it in any neural network. In this post, we talk about an important application of performance learning. And this new neural network was invented by Frank Widrow and his gradate student Marcian Hoff in 1960 when it was almost at the same time as Perceptron which had been discussed in ‘Perceptron Learning Rule’. It is called Widrow-Hoff Learning. This is an approximate steepest descent algorithm. And the performance index used in the learning rule is mean square error.

Perceptron was discussed because it is still used in the current task. It’s a kind of basic block of a current neural network as well. However, Widrow-Hoff learning is supposed to be discussed because it is:

1. widely used in many signal processing
2. and precursor to the backpropagation algorithm which is a very important aspect of current deep learning research.

ADALINE (Adaptive LInear NEuron) and a learning rule called LMS(Least Mean Square) algorithm were introduced in the paper ‘Adaptive switching circuits’. The only distinction between perceptron and ADALINE is only the transfer function which in perceptron is a hard-limiting but in ADALINE is linear. And they have the same inherent limitation: they can only deal with the linear separable problem. And the learning rule LMS algorithm is more powerful than the perceptron learning rule. The perceptron learning algorithm always gives a decision boundary through a training point(an example of the training set)or near a training point as we have talked in ‘Perceptron Learning Rule’. And the training point somehow can also be called a pattern in some cases. LMS had great success in signal processing but it did not work well in adapting the algorithm to the multilayer network. By the way, after 1906s Dr. Widrow gave up the research in neural networks and 1980s he went back to the neural network. Backpropagation is a descendant of the LMS algorithm.

It’s an abbreviated notation of ADALINE network is;

It can be notated mathematically as:

\begin{aligned} \boldsymbol{a}&=\text{pureline}(W\boldsymbol{p} + \boldsymbol{b})\\ &= W\boldsymbol{p}+\boldsymbol{b} \end{aligned}\tag{1}

This is already a simple model, but we would like to use a more simplified model: a neuron with 2-inputs. Its abbreviated notation is:

whose mathematical notation is:

\begin{aligned} \boldsymbol{a}&= W\boldsymbol{p}+\boldsymbol{b}\\ &=w_{1,1}p_1+w_{1,2}p_2+b \end{aligned}\tag{2}

and decision boundary of this 2-input neuron is:

$\boldsymbol{w}$ is the vector consisting of the weights in neuron. And it always point to the region where $w_{1,1}p_1+w_{1,2}p_2+b>0$

This is the simplest ADALINE neural network architecture, and then we would like to investigate ‘how to modify its parameter’ - its learning rule.

## Mean Square Error

LMS is a kind of supervised learning and it needs a training set:

$\{\boldsymbol{p}_1,t_1\},\{\boldsymbol{p}_2,t_2\},\cdots,\{\boldsymbol{p}_Q,t_Q\}\tag{3}$

And we used the information of the interaction between the output of the neural network of a certain input $\boldsymbol{p}_i$ and its corresponding target $t_i$

The information used in LMS is the mean square error(MSE), which is the difference between output and target. This is the performance index of the ADALINE neural network.

To make the calculating process more beautiful, we lump the parameters up including bias into:

$\boldsymbol{x}=\begin{bmatrix} _1\boldsymbol{w}\\ b \end{bmatrix}\tag{4}$

and input lump up respectely:

$\boldsymbol{z}=\begin{bmatrix} \boldsymbol{p}\\ 1 \end{bmatrix}\tag{5}$

$\boldsymbol{x}$ is used to refer to as parameter for make it sucessive with the posts about performance optimization(‘Performance Surfaces and Optimum Points’)
and equation(1) became:

$a=W\boldsymbol{p}+\boldsymbol{b}=\boldsymbol{x}^T\boldsymbol{z}\tag{6}$

Expression for the ADALINE network mean square error:

$F(x)=\mathbb {E}[e^2]=\mathbb {E}[(t-a)^2]=\mathbb {E} [(t-\boldsymbol{x}^T\boldsymbol{z})^2]\tag{7}$

where the symbol $\mathbb{E}$ is the expectation who is generalized definition – time average for the deterministic signal. Average is also called mean and the mean of the square of error so it is called MSE.

$\mathbb{E}[(t-\boldsymbol{x}^T\boldsymbol{z})^2]=\mathbb{E}[t^2]-2\boldsymbol{x}^T\mathbb{E}[t\boldsymbol{z}]+\boldsymbol{x}^T\mathbb{E}[\boldsymbol{z}^T\boldsymbol{z}]\boldsymbol{x}\tag{8}$

where $\boldsymbol{x}$ and $t$ is not a random variable so its expectation is itself. And equation(8) can be simplified as

$F(\boldsymbol{x})=C-2\boldsymbol{x}^T\boldsymbol{h}+\boldsymbol{x}^TR\boldsymbol{x}\tag{9}$

where:

• $\boldsymbol{h}=t\boldsymbol{z}$ in statistical view, this is cross-correlation between input and output.
• $R=\mathbb{E}[\boldsymbol{z}\boldsymbol{z}^T]$ is input correlation matrix whose diagonal iters are mean square value of input.
• $C=\mathbb{E}[t^2]$ is a constant.

equation(9) is a ‘quadratic function’. and we can rewrite it in the form:

$F(\boldsymbol{x}=C+ \boldsymbol{d}^T\boldsymbol{x}+\frac{1}{2}\boldsymbol{x}^TA\boldsymbol{x})\tag{10}$

where:

• $C=C$
• $\boldsymbol{d}^T=2\boldsymbol{h}^T$
• $A=2R=2\mathbb{E}[\boldsymbol{z}\boldsymbol{z}^T]$ is positive definite or positive semidefinite matrix. So this means square error function could have:
1. a strong minimum if $R$ is positive definite and could also have a weak minimum or
2. no minimum if $R$ is positive semidefinite.

Then the stationary point could be at the point:

$\nabla F(\boldsymbol{x})=-2\boldsymbol{h}+2R\boldsymbol{x}=0\tag{11}$

And finally we get the minimum point of MSE:

$\boldsymbol{x}^{\star}=R^{-1}\boldsymbol{h}\tag{12}$

because $R$ is comprised of inputs, so inputs decides the matrix $R$ and $R$ decides the minimum of MSE $\boldsymbol{x}^{\star}$. This means when we have no information of the Hessian matrix of the performance index, our inputs are used to comprise the approximation of the Hessian matrix.

## LMS Algorithm

LMS is the short for least mean square. And it is the algorithm for searching the minimum of the performance index.

When $\boldsymbol{h}$ and $R$ are known and stationary points can be found directly. If $R^{-1}$ is impossible to calculate we can use ‘steepest descent algorithm’. However, the most common cause is both $\boldsymbol{h}$ and $R$ are unknown or are not convenient to be calculated. We would use approximate steepest descent in which we use an estimated gradient for this case.

And we also instead expectation of square error with a squared error:

$\hat{F}(\boldsymbol{x})=(t(k)-a(k)^2)=e^2(k)\tag{13}$

and the approximation of gradient is:

$\hat{\nabla}F=\nabla e^2(k)\tag{14}$

This is the key point of the algorithm and it is also known as ‘sstochastic gradient’. And when this approximation is used in gradient descent algorithm it is referred to as

• on-line learning
• incremental learning

which means parameters are updated as each input is presented.

In the $\nabla e^2(k)$, the first $R$ components are the parts decided by the weights. So we have:

$[\nabla e^2(k)]_j=\frac{\partial e^2(k)}{\partial w_{1,j}}=2 e(k)\frac{\partial e(k)}{\partial w_{1,j}}\tag{15}$

for $j=1,2,\cdots, R$ and similarly

$[\nabla e^2(k)]_{j+1}=\frac{\partial e^2(k)}{\partial b}=2 e(k)\frac{\partial e(k)}{b}\tag{16}$

now we consider $\frac{\partial e(k)}{\partial w_{1,j}}$ whose error is the difference between the output of ADALINE neuron with $R$ weights and target:

\begin{aligned} \frac{\partial e(k)}{\partial w_{1,j}}&=\frac{\partial[t(k)-a(k)]}{\partial w_{1,j}}\\ &=\frac{\partial}{\partial w_{1,j}}[t(k)-(_1W^T\boldsymbol{p}(k)+b)]\\ &=\frac{\partial}{\partial w_{1,j}}[t(k)-(\sum_{i=1}^Rw_{1,i}p_i(k)+b)] &=-p_j(k) \end{aligned}\tag{17}

and similarly:

$\frac{\partial e(k)}{b} = -1\tag{18}$

and $\boldsymbol{z}$ is conprised of $p_j{k}$ and $1$ as we mentioned above, so the equation(14) can be rewritten as:

$\hat{\nabla}F=\nabla e^2(k)=-2e(k)\boldsymbol{z}(k)\tag{19}$

equation(19) gives us a beautiful form of the approximation of the gradient of squared error who is also an approximation of mean squared error.

Then we take approximation of $\nabla F$ into steepest descent algorithm:

$\boldsymbol{x}_{k+1}=\boldsymbol{x}_{k}-\alpha \hat{\nabla} F(\boldsymbol{x})\bigg |_{\boldsymbol{x}=\boldsymbol{x}_k^{\star}}\tag{20}$

and take $\hat{\nabla} F = -2e(k)\boldsymbol{z}(k)$ into equation(20):

$\boldsymbol{x}_{k+1}=\boldsymbol{x}_{k}+2\alpha e(k)\boldsymbol{z}(k)\tag{21}$

This is an LMS algorithm which is also known as the delta rule and Widrow-Hoff Learning algorithm. To multiple neurons neural network, we have a matrix form:

$W(k+1)=W(k)+2\alpha \boldsymbol{e}(k) \boldsymbol{p}^T(k)\tag{22}$

## Analysis of Convergence

We have analyzed the convergence of the steepest descent algorithm, when

$\alpha<\frac{2}{\lambda_{\text{max}}}\tag{23}$

and to LMS, $\boldsymbol{x}(k)$ is a function of $\boldsymbol{z}(k-1),\cdots \boldsymbol{z}(0)$ and we assume $\boldsymbol{z}(k-1),\cdots \boldsymbol{z}(0)$ are statistical independent and $\boldsymbol{x}(k)$ is independent to $\boldsymbol{z}(k)$ statistically.

Because what we have used is an approximation of the steepest descent algorithm which guarantees to converge to $\boldsymbol{x}^{\star}=R^{-1}\boldsymbol{h}$ for equation(9) and what we should do is to proof LMS converge to $\boldsymbol{x}^{\star}=R^{-1}\boldsymbol{h}$ as well.

The update procedure of LMS is:

$\boldsymbol{x}_{k+1}=\boldsymbol{x}_{k}+2\alpha e(k)\boldsymbol{z}(k)\tag{24}$

The expectation of both side of equation(24):

$\mathbb{E}[\boldsymbol{x}_{k+1}]=\mathbb{E}[\boldsymbol{x}_{k}]+2\alpha \mathbb{E}[e(k)\boldsymbol{z}(k)]\tag{25}$

substitude $t(k)-\boldsymbol{x}_k^T\boldsymbol{z}(k)$ for the error $e(k)$ and get:

$\mathbb{E}[\boldsymbol{x}_{k+1}]=\mathbb{E}[\boldsymbol{x}_{k}]+2\alpha \mathbb{E}[(t(k)-\boldsymbol{x}_k^T\boldsymbol{z}(k))\boldsymbol{z}(k)] \tag{26}$

substitude $\boldsymbol{z}^T(k)\boldsymbol{x}_k$ for
$\boldsymbol{x}^T_k\boldsymbol{z}(k)$ and rearrange terms to:

$\mathbb{E}[\boldsymbol{x}_{k+1}]=\mathbb{E}[\boldsymbol{x}_{k}]+2\alpha \{\mathbb{E}[t(k)\boldsymbol{z}(k)]-\mathbb{E}[\boldsymbol{z}(k)\boldsymbol{z}^T(k)\boldsymbol{x}_k]\} \tag{27}$

because $\boldsymbol{x}_k$ is independent of $\boldsymbol{z}(k)$:

$\mathbb{E}[\boldsymbol{x}_{k+1}]=\mathbb{E}[\boldsymbol{x}_{k}]+2\alpha \{\boldsymbol{h}-R\mathbb{E}[\boldsymbol{x}_k]\} \tag{28}$

and this can also be written as:

$\mathbb{E}[\boldsymbol{x}_{k+1}]=[I-2\alpha R]\mathbb{E}[\boldsymbol{x}_{k}]+2\alpha \boldsymbol{h} \tag{29}$

this is also a dynamic system and to make it stable, eigenvalues of $I-2\alpha R$ should be greater than $-1$ and less than $1$. And this form of the matrix has the same eigenvectors of $R$ and its eigenvalues are $1-2\alpha \lambda_i$ so:

\begin{aligned} 1>1-2\alpha\lambda_i>&-1\\ 0<\alpha<&\frac{1}{\lambda_i} \end{aligned}\tag{30}

equation(30) is equivalent to:

$\alpha<\frac{1}{\lambda_\text{max}}\tag{31}$

Because of $A=2R$ so equation (31) gives the same condition as ‘steepest descent algorithm’

Under this condition, when the system become satable, we would have:

$\mathbb{E}[\boldsymbol{x}_{\text{ss}}]=[I-2\alpha R]\mathbb{E}[\boldsymbol{x}_{\text{ss}}] +2 \alpha \boldsymbol{h} \tag{32}$

or

$\mathbb{E}[\boldsymbol{x}_{\text{ss}}]=R^{-1}\boldsymbol{h}=\boldsymbol{x}^{\star} \tag{33}$

Equation(33) shows that LMS solution which is obtained by one by one input from time to time would finally give a convergence solution as the one whose Hessian matrix is known.

## References

1. Demuth, H.B., Beale, M.H., De Jess, O. and Hagan, M.T., 2014. Neural network design. Martin Hagan. ↩︎