The Backpropagation Algorithm (Part I)

Keywords: backpropagation, BP

Architecture[1]

We have seen a three-layer network is flexible in approximating functions. If we had a more-than-three-layer network, it could be used to approximate any functions as close as we want. However, another trouble came to us is how to train these networks. This problem almost killed neural networks in the 1970s. Until backpropagation(BP for short) algorithm was found that it is an efficient algorithm in training multiple layers networks.

3-layer network is also used in this post for it is the simplest multiple-layer network whose abbreviated notation is:

and a more short way to represent its architecture is:

RS1S2S3(1)R - S^1 - S^2 - S^3 \tag{1}

For the three-layer network has only three layers which is not too large to denote mathematically, then it can be writen as:

a=f3(W3f2(W2f1(W1p+b1)+b2)+b3)(2)\boldsymbol{a}=f^3(W^3\cdot f^2(W^2\cdot f^1(W^1\cdot \boldsymbol{p}+\boldsymbol{b}^1)+\boldsymbol{b}^2)+\boldsymbol{b}^3)\tag{2}

but, it becomes impossible when we have a 10-layer network or 100-layer network. Then we can use some short equations that describe the whole operation of MM-layer network:

am+1=fm+1(Wm+1am+bm+1)(3)a^{m+1}=f^{m+1}(W^{m+1}\boldsymbol{a}^{m}+\boldsymbol{b}^{m+1})\tag{3}

for m=1,2,3,M1m = 1, 2, 3, \cdots M-1. MM is the number of layers in the neural networks. And:

  • a0=p\boldsymbol{a}^0=\boldsymbol{p} is its input
  • a=aM\boldsymbol{a}=\boldsymbol{a}^M is its output

Performance Index

We have had a network now. And ‘performance learning’ is a kind of learning rule in learning task. Then we need a definite performance index for the 3-layer network. MSE is used here as the performance index the same as what the LMS algorithm did in post ‘Widrow-Hoff Learning(Part I)’ and ‘Widrow-Hoff Learning(Part II)’. And a set of examples of proper network behavior are needed:

{p1,t1},{p2,t2},{pQ,tQ}(4)\{\boldsymbol{p}_1,\boldsymbol{t}_1\},\{\boldsymbol{p}_2,\boldsymbol{t}_2\},\cdots \{\boldsymbol{p}_Q,\boldsymbol{t}_Q\}\tag{4}

where p1\boldsymbol{p}_1 is the input and t1\boldsymbol{t}_1 is the corresponding output.

Recall that in the steepest descent algorithm, we have a definite objective function, minimizing which is the task of the algorithm iteratively. BP is the generation of LMS algorithms, and both of them try to minimize the mean square error. However, the best structure of the network that is the closest to the true data is unknown. So each iteration an approximation of ‘Error’ which was calculated from our observation was used in the algorithm. And what we finally get is a trained neural network that fits the training set but not guarantees fitting the original task. So a good training set which can represent the original task as close as possible is necessary.

To make it easier to understand from the steepest descent algorithm to LMS and BP, we convert the weights and bias in the neural network form ww and bb into a vector x\boldsymbol{x}. Then the performance index is:

F(x)=E[e2]=E[(ta)2](5)F(\boldsymbol{x})=\mathbb E[e^2]=\mathbb E[(t-a)^2]\tag{5}

When network has multiple outputs this generalizes to:

F(x)=E[eTe]=E[(ta)T(ta)](6)F(\boldsymbol{x})=\mathbb E[\boldsymbol{e}^T\boldsymbol{e}]=\mathbb E[(\boldsymbol{t}-\boldsymbol{a})^T(\boldsymbol{t}-\boldsymbol{a})]\tag{6}

During an iteration, in the LMS algorithm, MSE(mean square error) is approximated by SE(square error):

F^(x)=(ta)T(ta)=eTe(7)\hat{F}(\boldsymbol{x})=(\boldsymbol{t}-\boldsymbol{a})^T(\boldsymbol{t}-\boldsymbol{a})=\boldsymbol{e}^T\boldsymbol{e}\tag{7}

where the expectations are replaced by the calculation of current input, output, and target.

Reviewing the ‘steepest descent algorithm’, the approximate MSE which is also called stochastic gradient descent:

wi,jm(k+1)=wi,jm(k)αF^wi,jmbim(k+1)=bim(k)αF^bim(8)\begin{aligned} w^m_{i,j}(k+1)&=w^m_{i,j}(k)-\alpha \frac{\partial \hat{F}}{\partial w^m_{i,j}}\\ b^m_{i}(k+1)&=b^m_{i}(k)-\alpha \frac{\partial \hat{F}}{\partial b^m_{i}} \end{aligned}\tag{8}

where α\alpha is the learning rate.

However, the steep descent algorithm seems can not work on a multiple-layer network for we can not just calculate the partial derivative in the hidden layer and input layer directly. So we should recall another mathematical tool - chain rule.

Chain Rule

Calculus

when ff is explicit function of n\boldsymbol{n} and n\boldsymbol{n} is a explicit function of w\boldsymbol{w}, we can calculate the partial derivative fw\frac{\partial f}{\partial w} by:

fw=fnnw(9)\frac{\partial f}{\partial w}=\frac{\partial f}{\partial n}\frac{\partial n}{\partial w}\tag{9}

The whole process looks like a chain. And let’s look at a simple example: when we have f(n)=enf(n)=e^n and n=2wn=2w, we have f(n(w))=e2wf(n(w))=e^{2w}. We can easily calculate the direvative fw=e2ww=2e2w\frac{\partial f}{\partial w}=\frac{\partial e^2w}{\partial w}=2e^{2w}. And when chain rule is used, we have:

f(n(w))w=ennnw=enn2ww=en2=2e2w(10)\frac{\partial f(n(w))}{\partial w}=\frac{\partial e^n}{\partial n}\frac{\partial n}{\partial w}=\frac{\partial e^n}{\partial n}\frac{\partial 2w}{\partial w}=e^n\cdot 2=2e^{2w}\tag{10}

that is the same as what we get directly.

When the chain rule is used in the second part on the right of equation (8), we get the way to calculate the derivative of the weight of hidden layers:

F^wi,jm=F^nimnimwi,jmF^bim=F^nimnimbim(11)\begin{aligned} \frac{\partial \hat{F}}{\partial w^m_{i,j}}&=\frac{\partial \hat{F}}{\partial n^m_i}\cdot \frac{\partial n^m_i}{\partial w^m_{i,j}}\\ \frac{\partial \hat{F}}{\partial b^m_{i}}&=\frac{\partial \hat{F}}{\partial n^m_i}\cdot \frac{\partial n^m_i}{\partial b^m_{i}} \end{aligned}\tag{11}

from the abbreviated notation, we know that nim=j=1Sm1wi,jmajm1+bimn^m_i=\sum^{S^{m-1}}_{j=1}w^m_{i,j}a^{m-1}_{j}+b^m_i. Then equation (11) can be writen as:

F^wi,jm=F^nimj=1Sm1wi,jmajm1+bimwi,jm=F^nimajm1F^bim=F^nimj=1Sm1wi,jmajm1+bimbim=F^nim1(12)\begin{aligned} \frac{\partial \hat{F}}{\partial w^m_{i,j}}&=\frac{\partial \hat{F}}{\partial n^m_i}\cdot \frac{\partial \sum^{S^{m-1}}_{j=1}w^m_{i,j}a^{m-1}_{j}+b^m_i}{\partial w^m_{i,j}}=\frac{\partial \hat{F}}{\partial n^m_i}\cdot a^{m-1}_j\\ \frac{\partial \hat{F}}{\partial b^m_{i}}&=\frac{\partial \hat{F}}{\partial n^m_i}\cdot \frac{\partial \sum^{S^{m-1}}_{j=1}w^m_{i,j}a^{m-1}_{j}+b^m_i}{\partial b^m_{i}}=\frac{\partial \hat{F}}{\partial n^m_i}\cdot 1 \end{aligned}\tag{12}

Equation (12) could also be simplified by defining a new concept: sensitivity.

Sensitivity

We define sensitivity as simF^nims^m_i\equiv \frac{\partial \hat{F}}{\partial n^m_{i}} that means the sensitivity of F^\hat{F} to changes in the ithi^{\text{th}} element of the net input at layer mm. Then equation (12) can be simplified as:

F^wi,jm=simajm1F^bim=sim1(13)\begin{aligned} \frac{\partial \hat{F}}{\partial w^m_{i,j}}&=s^m_{i}\cdot a^{m-1}_j\\ \frac{\partial \hat{F}}{\partial b^m_{i}}&=s^m_{i}\cdot 1 \end{aligned}\tag{13}

Then the steepest descent algorithm is:

wi,jm(k+1)=wi,jm(k)αsimajm1bim(k+1)=bim(k)αsim1(14)\begin{aligned} w^m_{i,j}(k+1)&=w^m_{i,j}(k)-\alpha s^m_{i}\cdot a^{m-1}_j\\ b^m_{i}(k+1)&=b^m_{i}(k)-\alpha s^m_{i}\cdot 1 \end{aligned}\tag{14}

This can also be written in a matrix form:

Wm(k+1)=Wm(k)αsm(am1)Tbm(k+1)=bm(k)αsm1(15)\begin{aligned} W^m(k+1)&=W^m(k)-\alpha \boldsymbol{s}^m(\boldsymbol{a}^{m-1})^T\\ \boldsymbol{b}^m(k+1)&=\boldsymbol{b}^m(k)-\alpha \boldsymbol{s}^m\cdot 1 \end{aligned}\tag{15}

where:

sm=F^αnm=[F^n1mF^n2mF^nSmm](16)\boldsymbol{s}^m=\frac{\partial \hat{F}}{\alpha \boldsymbol{n}^m}=\begin{bmatrix} \frac{\partial \hat{F}}{\partial n^m_1}\\ \frac{\partial \hat{F}}{\partial n^m_2}\\ \vdots\\ \frac{\partial \hat{F}}{\partial n^m_{S^m}}\\ \end{bmatrix}\tag{16}

And be careful of the s\boldsymbol{s} which means the sensitivity and SmS^m which means the number of layers mm

Backpropagating the Sensitivities

Equation (15) is our BP algorithm. But we can not calculate sensitivities yet. We can easily calculate the sensitivities of the last layer which is the same as LMS. And we have an inspiration that is we can use the relation between the latter layer and current layer. So let’s observe the Jacobian matrix which represents the relation between the latter layer linear combination output nm+1\boldsymbol{n}^{m+1} and current layer linear combination output nm\boldsymbol{n}^m:

nm+1nm=[n1m+1n1mn1m+1n2mn1m+1nSmmn2m+1n1mn2m+1n2mn2m+1nSmmnSm+1m+1n1mnSm+1m+1n2mnSm+1m+1nSmm](17)\frac{\partial \boldsymbol{n}^{m+1}}{\partial \boldsymbol{n}^{m}}= \begin{bmatrix} \frac{ \partial n^{m+1}_1}{\partial n^{m}_1} & \frac{\partial n^{m+1}_1}{\partial n^{m}_2} & \cdots & \frac{\partial n^{m+1}_1}{\partial n^{m}_{S^m}}\\ \frac{\partial n^{m+1}_2}{\partial n^{m}_1} & \frac{\partial n^{m+1}_2}{\partial n^{m}_2} & \cdots & \frac{\partial n^{m+1}_2}{\partial n^{m}_{S^m}}\\ \vdots&\vdots&&\vdots\\ \frac{\partial n^{m+1}_{S^{m+1}}}{\partial n^{m}_1} & \frac{\partial n^{m+1}_{S^{m+1}}}{\partial n^{m}_2} & \cdots & \frac{\partial n^{m+1}_{S^{m+1}}}{\partial n^{m}_{S^m}}\\ \end{bmatrix}\tag{17}

And the (i,j)th(i,j)^{\text{th}} element of the matrix is:

nim+1njm=(l=1Smwi,lm+1alm+bim+1)njm=wi,jm+1ajmnjm=wi,jm+1fm(njm)njm=wi,jm+1f˙m(njm)(18)\begin{aligned} \frac{\partial n^{m+1}_i}{\partial n^{m}_j}&=\frac{\partial (\sum^{S^m}_{l=1}w^{m+1}_{i,l}a^m_l+b^{m+1}_i)}{\partial n^m_j}\\ &= w^{m+1}_{i,j}\frac{\partial a^m_j}{\partial n^m_j}\\ &= w^{m+1}_{i,j}\frac{\partial f^m(n^m_j)}{\partial n^m_j}\\ &= w^{m+1}_{i,j}\dot{f}^m(n^m_j) \end{aligned}\tag{18}

where l=1Smwi,lm+1alm+bim+1\sum^{S^m}_{l=1}w^{m+1}_{i,l}a^m_l+b^{m+1}_i is the linear combination output of layer m+1m+1 and ama^m is the output of layer mm. And we can define f˙m(njm)=fm(njm)njm\dot{f}^m(n^m_j)=\frac{\partial f^m(n^m_j)}{\partial n^m_j}

Therefore the Jacobian matrix can be written as:

nm+1nm=Wm+1F˙m(nm)=[w1,1m+1f˙m(n1m)w1,2m+1f˙m(n2m)w1,Smm+1f˙m(nSmm)w2,1m+1f˙m(n1m)w2,2m+1f˙m(n2m)w2,Smm+1f˙m(nSmm)wSm+1,1m+1f˙m(n1m)wSm+1,2m+1f˙m(n2m)wSm+1,Smm+1f˙m(nSmm)](19)\begin{aligned} &\frac{\partial \boldsymbol{n}^{m+1}}{\partial \boldsymbol{n}^{m}}\\ =&W^{m+1}\dot{F}^m(\boldsymbol{n}^m)\\ =&\begin{bmatrix} w^{m+1}_{1,1}\dot{f}^m(n^m_1) & w^{m+1}_{1,2}\dot{f}^m(n^m_2) & \cdots & w^{m+1}_{1,{S^m}}\dot{f}^m(n^m_{S^m})\\ w^{m+1}_{2,1}\dot{f}^m(n^m_1) & w^{m+1}_{2,2}\dot{f}^m(n^m_2) & \cdots & w^{m+1}_{2,{S^m}}\dot{f}^m(n^m_{S^m})\\ \vdots&\vdots&&\vdots\\ w^{m+1}_{S^{m+1},1}\dot{f}^m(n^m_1) & w^{m+1}_{S^{m+1},2}\dot{f}^m(n^m_2) & \cdots & w^{m+1}_{S^{m+1},{S^m}}\dot{f}^m(n^m_{S^m}) \end{bmatrix} \end{aligned} \tag{19}

where we have:

F˙m(nm)=[f˙(n1m)000f˙(n2m)000f˙(nSmm)](20)\dot{F}^m(\boldsymbol{n}^m)= \begin{bmatrix} \dot{f}(n^m_1)&0&\cdots&0\\ 0&\dot{f}(n^m_2)&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&\dot{f}(n^m_{S^m}) \end{bmatrix}\tag{20}

Then recurrence relation for the sensitivity by using chain rule in matrix form is:

sm=F^nm=(nm+1nm)TF^nm+1=F˙m(nm)Wm+1sm+1(21)\begin{aligned} \boldsymbol{s}^m&=\frac{\partial \hat{F}}{\partial n^m}\\ &=(\frac{\partial \boldsymbol{n}^{m+1}}{\partial \boldsymbol{n}^{m}})^T\cdot \frac{\partial \hat{F}}{\partial n^{m+1}}\\ &=\dot{F}^m(\boldsymbol{n}^m)W^{m+1}\boldsymbol{s}^{m+1}\\ \end{aligned}\tag{21}

This is why it is called backpropagation, because the sensitivities of layer mm is calculated by layer m+1m+1 :

SMSM1SM2S1(22)S^{M}\to S^{M-1}\to S^{M-2}\to \cdots \to S^{1}\tag{22}

Same to the LMS algorithm, BP is also an approximating algorithm of the steepest descent technique. And the start of BP siM\boldsymbol{s}^M_i is:

siM=F^nim=(ta)T(ta)nim=j=1SM(tjaj)2niM=2(tiai)ainiM(23)\begin{aligned} \boldsymbol{s}^M_i&=\frac{\partial \hat{F}}{\partial n^m_i}\\ &=\frac{\partial (\boldsymbol{t}-\boldsymbol{a})^T(\boldsymbol{t}-\boldsymbol{a})}{\partial n^m_i}\\ &=\frac{\partial \sum_{j=1}^{S^M}(t_j-a_j)^2}{\partial n^M_i}\\ &=-2(t_i-a_i)\frac{\partial a_i}{\partial n^M_{i}} \end{aligned}\tag{23}

and this is easy to understand because it is just a variation of the LMS algorithm. Since

ainiM=fM(niM)niM=f˙M(njM)(24)\frac{\partial a_i}{\partial n^M_i}=\frac{\partial f^M(n^M_i)}{\partial n^M_i}=\dot{f}^M(n^M_j)\tag{24}

we can write:

siM=2(tiai)f˙M(niM)(25)s^M_i=-2(t_i-a_i)\dot{f}^M(n^M_i)\tag{25}

and its matrix form is:

siM=F˙M(nM)(ta)(26)\boldsymbol{s}^M_i=-\dot{F}^M(\boldsymbol{n}^M)(\boldsymbol{t}-\boldsymbol{a})\tag{26}

Summary of BP

  1. Propagate the input forward through the network
    • a0=p\boldsymbol{a}^0=\boldsymbol{p}
    • am+1=fm+1(Wm+1am+bm+1)\boldsymbol{a}^{m+1}=f^{m+1}(W^{m+1}\boldsymbol{a}^m+\boldsymbol{b}^{m+1}) for m=0,1,2,,M1m=0,1,2,\cdots, M-1
    • a=aM\boldsymbol{a}=\boldsymbol{a}^M
  2. Propagate the sensitivities backward through the network:
    • sM=2F˙M(nM)(ta)\boldsymbol{s}^M=-2\dot{F}^M(\boldsymbol{n}^M)(\boldsymbol{t}-\boldsymbol{a})
    • sm=F˙m(nm)(Wm+1)Tsm+1)\boldsymbol{s}^m= \dot{F}^m(\boldsymbol{n}^m)(W^{m+1})^T\boldsymbol{s}^{m+1}) for m=M1,,2,1m=M-1,\cdots,2,1
  3. Finally, the weights and bias are updated using the approximate steepest descent rule:
    • Wm(k+1)=Wm(k)αsm(am1)TW^{m}(k+1)=W^{m}(k)-\alpha \boldsymbol{s}^m(\boldsymbol{a}^{m-1})^T
    • bm(k+1)=bm(k)αsm\boldsymbol{b}^{m}(k+1)=\boldsymbol{b}^{m}(k)-\alpha \boldsymbol{s}^m

An example of BP algorithm

Here we look at a simple example. Considering the function:

g(p)=1+sin(π4p) for 2p2(27)g(p)=1+\sin(\frac{\pi}{4}p) \text{ for } -2\leq p\leq 2\tag{27}

Then the sampling process should be done. And we get same function values with some selected inputs pp's, for example, we have a set of

p={2,1.8,,1.8,2.0}(28)p=\{-2,-1.8,\cdots ,1.8,2.0\}\tag{28}

then we get corresponding function values of g(p)g(p) are

{1+sin(π4×(2)),1+sin(π4×(1.8)),,1+sin(π4×(1.8)),1+sin(π4×(2))}(29)\begin{aligned} \{&\\ &1+\sin(\frac{\pi}{4}\times(-2)),\\ &1+\sin(\frac{\pi}{4}\times(-1.8)),\\ &\cdots,\\ &1+\sin(\frac{\pi}{4}\times(1.8)),\\ &1+\sin(\frac{\pi}{4}\times(2))\\ \}& \end{aligned} \tag{29}

So, we have had training data, and then we need to choose a network architecture. The following network would be used:

This 1-2-1 network is relatively simpler than other networks. And what we should do next is to choose initial values for the network weights and biases. They are usually chosen to be small random values. Here we chose all of these parameters from a uniform distribution over [1,1][-1,1].

And the inputs {2.0,1.8,1.6,,1.8,2.0}\{-2.0,-1.8,-1.6,\cdots,1.8,2.0\} are fed into the 1-2-1 network whose initial weights are:

W1(0)=[0.270.41]b1(0)=[0.480.13]W2(0)=[0.090.17]b2(0)=[0.48](30)W^1(0)=\begin{bmatrix} -0.27\\-0.41 \end{bmatrix}\\ \boldsymbol{b}^1(0)=\begin{bmatrix} -0.48\\-0.13 \end{bmatrix}\\ W^2(0)=\begin{bmatrix} 0.09\\-0.17 \end{bmatrix}\\ \boldsymbol{b}^2(0)=\begin{bmatrix} 0.48 \end{bmatrix}\tag{30}

and we get the output of the network are:

Now let’s start the algorithm step by step:

  1. we have the first input a0=p=1\boldsymbol{a}^0=\boldsymbol{p}=1
    1. The output of the first layer is then

      a1=f(W1a0+b1)=logsig([0.270.41][1]+[0.480.13])=logsig([0.750.54])=[11+e0.7511+e0.54]=[0.3210.368]\begin{aligned} \boldsymbol{a}^1&=f(W^1\boldsymbol{a}^0+\boldsymbol{b}^1)\\ &=\text{logsig}(\begin{bmatrix}-0.27\\-0.41\end{bmatrix}\begin{bmatrix}1\end{bmatrix}+\begin{bmatrix}-0.48\\-0.13\end{bmatrix})\\ &=\text{logsig}(\begin{bmatrix}-0.75\\-0.54\end{bmatrix})\\ &=\begin{bmatrix}\frac{1}{1+e^{0.75}}\\\frac{1}{1+e^{0.54}}\end{bmatrix}\\ &=\begin{bmatrix}0.321\\0.368\end{bmatrix} \end{aligned}

    2. The output of the second layer is:

      a2=f(W2a1+b2)=logsig([0.090.17][0.3210.368]+[0.48])=logsig([0.446])\begin{aligned} \boldsymbol{a}^2&=f(W^2\boldsymbol{a}^1+\boldsymbol{b}^2)\\ &=\text{logsig}(\begin{bmatrix}0.09&-0.17\end{bmatrix}\begin{bmatrix}0.321\\0.368\end{bmatrix}+\begin{bmatrix}0.48\end{bmatrix})\\ &=\text{logsig}(\begin{bmatrix}0.446\end{bmatrix}) \end{aligned}

  2. Because we have generated both the input and target in equation(3), we can calculate the error:

    e=ta={1+sin(π4p)}a2={1+sin(π4×1)}0.446=1.261\begin{aligned} e=t-a&=\{1+\sin(\frac{\pi}{4}p)\}-a^2\\ &=\{1+\sin(\frac{\pi}{4}\times 1)\}-0.446\\ &=1.261 \end{aligned}

  3. We then backpropagate the sensitivities during which derivatives of transfer function are needed:

    f˙1(n)=ddn(11+en)=en(1+en)2=(111+en)(11+en)=(1a1)(a1)\begin{aligned} \dot{f}^1(n)&=\frac{d}{dn}(\frac{1}{1+e^{-n}})\\ &=\frac{e^{-n}}{(1+e^{-n})^2}\\ &=(1-\frac{1}{1+e^{-n}})(\frac{1}{1+e^{-n}})\\ &=(1-a^1)(a^1) \end{aligned}

    and

    f˙2(n)=ddn(n)=1 \dot{f}^2(n)=\frac{d}{dn}(n)=1

  4. Now we have all components which is required for backpropagating
    1. The starting point is found in the second layer:

      s2=2F˙2(n1)(ta)=2[f˙2(n2)](1.261)=2[1](1.261)=2.522\begin{aligned} \boldsymbol{s}^2&=2\dot{F}^2(\boldsymbol{n}^1)(\boldsymbol{t}-\boldsymbol{a})\\ &=-2\begin{bmatrix} \dot{f}^2(n^2) \end{bmatrix}(1.261)\\ &=-2\begin{bmatrix} 1 \end{bmatrix}(1.261)\\ &=-2.522 \end{aligned}

    2. And the sensitivity of first layer is:

      s1=F˙1(n1)(W2)Ts2=[(1a11)(a11)00(1a21)(a21)][0.090.17][2.522]=[(10.321)(0.321)00(10.368)(0.368)][0.090.17][2.522]=[0.218000.233][0.2270.429]=[0.04950.0997]\begin{aligned} \boldsymbol{s}^1&=\dot{F}^1(\boldsymbol{n}^1)(W^2)^T\boldsymbol{s}^2\\ &=\begin{bmatrix} (1-a^1_1)(a^1_1)&0\\ 0&(1-a^1_2)(a^1_2) \end{bmatrix}\begin{bmatrix} 0.09\\ -0.17 \end{bmatrix}\begin{bmatrix} -2.522 \end{bmatrix}\\ &=\begin{bmatrix} (1-0.321)(0.321)&0\\ 0&(1-0.368)(0.368) \end{bmatrix}\begin{bmatrix} 0.09\\ -0.17 \end{bmatrix}\begin{bmatrix} -2.522 \end{bmatrix}\\ &=\begin{bmatrix} 0.218&0\\ 0&0.233 \end{bmatrix}\begin{bmatrix} -0.227\\ 0.429 \end{bmatrix}\\ &=\begin{bmatrix} -0.0495\\ 0.0997 \end{bmatrix} \end{aligned}

  5. Finally we update the weights. For similicity, we will use a learning rate α=0.01\alpha = 0.01:

    W2(1)=W2(0)αs2(a1)T=[0.090.17]0.1[2.522][0.3210.368]=[0.1710.0772]\begin{aligned} W^2(1)&=W^2(0)-\alpha\boldsymbol{s}^2(\boldsymbol{a}^1)^T\\ &=\begin{bmatrix}0.09&-0.17\end{bmatrix}-0.1\begin{bmatrix}-2.522\end{bmatrix}\begin{bmatrix}0.321&0.368\end{bmatrix}\\ &=\begin{bmatrix}0.171&-0.0772\end{bmatrix} \end{aligned}

    b2(1)=b2(0)αs21=[0.48]0.1[2.522]=[0.732]\begin{aligned} \boldsymbol{b}^2(1)&=\boldsymbol{b}^2(0)-\alpha\boldsymbol{s}^2\cdot 1\\ &=\begin{bmatrix}0.48\end{bmatrix}-0.1\begin{bmatrix}-2.522\end{bmatrix}\\ &=\begin{bmatrix}0.732\end{bmatrix} \end{aligned}

    W1(1)=W1(0)αs1(a0)T=[0.270.41]0.1[0.04950.0997][1]=[0.2650.420]\begin{aligned} W^1(1)&=W^1(0)-\alpha\boldsymbol{s}^1(\boldsymbol{a}^0)^T\\ &=\begin{bmatrix}-0.27\\-0.41\end{bmatrix}-0.1\begin{bmatrix}-0.0495\\0.0997\end{bmatrix}\begin{bmatrix}1\end{bmatrix}\\ &=\begin{bmatrix}-0.265\\-0.420\end{bmatrix} \end{aligned}

    b1(1)=b1(0)αs11=[0.480.13]0.1[0.04950.0997]=[0.4750.140]\begin{aligned} \boldsymbol{b}^1(1)&=\boldsymbol{b}^1(0)-\alpha\boldsymbol{s}^1\cdot 1\\ &=\begin{bmatrix}-0.48\\-0.13\end{bmatrix}-0.1\begin{bmatrix}-0.0495\\0.0997\end{bmatrix}\\ &=\begin{bmatrix}-0.475\\-0.140\end{bmatrix} \end{aligned}

References


  1. Demuth, Howard B., Mark H. Beale, Orlando De Jess, and Martin T. Hagan. Neural network design. Martin Hagan, 2014. ↩︎