ML Notes

View on GitHub

Neural Networks

Perceptron

Perceptron Algorithm

Error Function

Importance of Activation Functions

Sigmoid Activation Function

Softmax Function

One-Hot Encoding

Maximum Likelihood

Cross-Entropy Error Function

Cross-Entropy Motivation

Cross-Entropy Notation

Cross-Entropy Equations

\[\begin{align}\hat{y}_{i}&=\sigma(WX_{i}+b) \\ \hat{y}_{i}&=\sigma(\sum_{j=1}^{n}w_{j}x_{j}+b) \\ \hat{y}_{i}&=\sigma(w_{1}x_{1}+\ldots+w_{n}x_{n}+b)\end{align}\] \[E=-\frac{1}{m}\sum_{i=1}^{m}y_{i}\ln(\hat{y}_{i})+(1-y_{i})\ln(1-\hat{y}_{i})\] \[E(W,b)=-\frac{1}{m}\sum_{i=1}^{m}y_{i}\ln(\sigma(WX_{i}+b))+(1-y_{i})\ln(1-\sigma(WX_{i}+b))\] \[E=-\sum_{i=1}^{m}\sum_{j=1}^{n}y_{ij}\ln(\hat{y}_{ij})\]

Cross-Entropy Explanations

Gradient Descent

Gradient Descent Motivation

Gradient Descent Notation

GD Single Train Sample \(\nabla E_{i}\)

For a single training sample, \(X_{i}\):

\[\begin{align}\nabla E_{i} &= (\frac{\partial}{\partial w_{1}}E_{i},\ldots,\frac{\partial}{\partial w_{n}}E_{i},\frac{\partial}{\partial b}E_{i})\\ &= (\hat{y}_{i}-y_{i})(x_{1},\ldots,x_{n},1)\end{align}\]

Proof.

\[\text{Individual training sample predictions:}\\ \hat{y}_{i}=\sigma(WX_{i}+b)\] \[\text{Individual training sample error:}\\ E_{i}=-y_{i}\ln(\hat{y}_{i})-(1-y_{i})\ln(1-\hat{y}_{i})\] \[\text{Gradient is equal to partial derivatives of error for each weight:}\\ \nabla E_{i} = (\frac{\partial}{\partial w_{1}}E_{i},\ldots,\frac{\partial}{\partial w_{n}}E_{i},\frac{\partial}{\partial b}E_{i})\\\\\] \[\text{Sigmoid function derivative:}\\ \begin{align} \sigma'(x) &=\frac{d}{dx}\left(\frac{1}{1+e^{-x}}\right) \\ &=\frac{e^{-x}}{(1+e^{-x})^{2}} &&\text{(quotient rule)} \\ &=\frac{1}{1+e^{-x}}\cdot \frac{e^{-x}}{1+e^{-x}} \\ &=\sigma(x)(1-\sigma(x))&&\text{(long division)}\\\\ \end{align}\] \[\text{Partial derivative of prediction:}\] \[\begin{align} \frac{\partial}{\partial w_{j}}\hat{y}_{i}&=\frac{\partial}{\partial w_{j}}(\sigma(WX_{i}+b)) &&(\hat{y}_{i}\text{ formula)} \\ &= \sigma(WX_{i}+b)(1-\sigma(WX_{i}+b))\cdot\frac{\partial}{\partial w_{j}}(WX_{i}+b) &&(\sigma'(x) \text{ formula)}\\ &= \hat{y}_{i}(1-\hat{y}_{i})\cdot\frac{\partial}{\partial w_{j}}(WX_{i}+b) \\ &= \hat{y}_{i}(1-\hat{y}_{i})\cdot\frac{\partial}{\partial w_{j}}(w_{1}x_{1}+\ldots+w_{j}x_{j}+\ldots+w_{n}x_{n}+b) \\ &= \hat{y}_{i}(1-\hat{y}_{i})\cdot(0+\ldots+x_{j}+\ldots+0) &&\text{(partial derivative)}\\ &= \hat{y}_{i}(1-\hat{y}_{i})\cdot x_{j}\\\\ \end{align}\] \[\text{Partial derivative of error:}\] \[\begin{align}\frac{\partial}{\partial w_{j}}E_{i}&=\frac{\partial}{\partial w_{j}}(-y_{i}\ln(\hat{y}_{i})-(1-y_{i})\ln(1-\hat{y}_{i})) &&(E_{i}\text{ formula)}\\ &= -y_{i}(\frac{\partial}{\partial w_{j}}(\ln(\hat{y}_{i})))-(1-y_{i})(\frac{\partial}{\partial w_{j}}(\ln(1-\hat{y}_{i})))\\ &= -y_{i}(\frac{1}{\hat{y}_{i}}\cdot\frac{\partial}{\partial w_{j}}(\hat{y}_{i}))-(1-y_{i})(\frac{1}{1-\hat{y}_{i}}\cdot\frac{\partial}{\partial w_{j}}(1-\hat{y}_{i})) &&\text{(chain rule)}\\ &= -y_{i}(\frac{1}{\hat{y}_{i}}\cdot\hat{y}_{i}(1-\hat{y}_{i})x_{j})-(1-y_{i})(\frac{1}{1-\hat{y}_{i}}\cdot(-1)\hat{y}_{i}(1-\hat{y}_{i})x_{j})&&(\frac{\partial}{\partial w_{j}}\hat{y}_{i}\text{ formula)}\\ &= -y_{i}(1-\hat{y}_{i})x_{j}+(1-y_{i})\hat{y}_{i}\cdot x_{j}\\ &= (-y_{i}+y_{i}\hat{y}_{i}+\hat{y}_{i}-y_{i}\hat{y}_{i})x_{j}\\ &= -(y_{i}-\hat{y}_{i})x_{j}\\\\ \end{align}\] \[\text{By a similar proof:}\] \[\frac{\partial}{\partial b}E_{i}=-(y_{i}-\hat{y}_{i})\\\\\] \[\text{Gradient of error:}\] \[\begin{align} \nabla E_{i} &= (\frac{\partial}{\partial w_{1}}E_{i},\ldots,\frac{\partial}{\partial w_{n}}E_{i},\frac{\partial}{\partial b}E_{i})\\ &= \left(-(y_{i}-\hat{y}_{i})x_{1},\ldots,-(y_{i}-\hat{y}_{i})x_{n},-(y_{i}-\hat{y}_{i})\right)\\ &= -(y_{i}-\hat{y}_{i})(x_{1},\ldots,x_{n},1)\\ &= (\hat{y}_{i}-y_{i})(x_{1},\ldots,x_{n},1)\ \ \ \ \blacksquare\\\\ \end{align}\]

GD Overall \(\nabla E\)

For \(m\) training samples:

\[\nabla E = \frac{1}{m}\sum_{i=1}^{m}(\hat{y}_{i}-y_{i})(x_{1},\ldots,x_{n},1)\]

Proof.

\[\text{Overall error = average of individual train sample errors:}\] \[E=\frac{1}{m}\sum_{i=1}^{m}E_{i}\] \[\text{Overall gradient of error = average of individual train sample gradients:}\] \[\begin{align}\nabla E &= \frac{1}{m}\sum_{i=1}^{m}\nabla E_{i}\\ &=\frac{1}{m}\sum_{i=1}^{m}(\hat{y}_{i}-y_{i})(x_{1},\ldots,x_{n},1)\ \ \ \ \blacksquare\\\\\end{align}\]

Logistic Regression Algorithm

When Batch Size \(=1\)

  1. Initialize random weights: \(w_{1},\ldots,w_{n},b\)
  2. For every train sample: \(X_{1},\ldots,X_{m}\)
    • Update weights: \(w_{j}\leftarrow w_{j}-\alpha\frac{\partial}{\partial w_{j}}E_{i}\)
    • Update bias: \(b\leftarrow b-\alpha\frac{\partial}{\partial b}E_{i}\)
  3. Repeat until error is small

When Batch Size \(=m\)

  1. Initialize random weights: \(w_{1},\ldots,w_{n},b\)
  2. For every batch:
    • Update weights:\(w_{j}\leftarrow w_{j}-\alpha\frac{1}{m}\sum_{i=1}^{m}\frac{\partial}{\partial w_{j}}E_{i}\)
    • Update bias:\(b\leftarrow b-\alpha\frac{1}{m}\sum_{i=1}^{m}\frac{\partial}{\partial b}E_{i}\)
  3. Repeat until error is small

Neural Networks

NN Notation

NN Forward Propagation Equations

\[\hat{y}_{i}=\sigma(W^{(L-1)}(\sigma(W^{(L-2)}(\ldots(\sigma(W^{(1)}X_{i}))))))\] \[\hat{y}_{i}=\sigma(W^{(3)}(\sigma(W^{(2)}(\sigma(W^{(1)}X_{i})))))\]

NN Backpropagation

Backprop Method Overview

Backprop Intuitive Understanding

Why Understand Backprop?

Sample NN Diagram

Example NN

Calculating \(\frac{\partial}{\partial W^{(1)}_{11}}E\)

For the sample NN:

\[\begin{align} \frac{\partial}{\partial W^{(1)}_{11}}E &= \left(\frac{a_1^{(4)}-y}{a_1^{(4)}(1-a_1^{(4)})}\right) \\ &\phantom{0000} \cdot \left(a_1^{(4)}(1-a_1^{(4)})\right) \\ &\phantom{0000} \cdot \left(\left(W_{11}^{(3)} \cdot a_1^{(3)}(1-a_1^{(3)}) \cdot W_{11}^{(2)}\right) + \left(W_{21}^{(3)} \cdot a_2^{(3)}(1-a_2^{(3)}) \cdot W_{12}^{(2)}\right) + \left(W_{31}^{(3)} \cdot a_3^{(3)}(1-a_3^{(3)}) \cdot W_{13}^{(2)}\right)\right) \\ &\phantom{0000} \cdot \left(a_1^{(2)}(1-a_1^{(2)})\right) \\ &\phantom{0000} \cdot \left(a_1^{(1)}\right) \end{align}\]

Proof.

\[\text{For simplicity, Let:}\] \[\begin{align} x_1=a_1^{(1)},\ x_2&=a_2^{(1)},\ x_3=a_3^{(1)}, \\ \hat{y}&=a_1^{(4)} \end{align}\] \[\text{Equations from sample NN:}\] \[\begin{align} z_1^{(2)} &= W_{11}^{(1)}a_1^{(1)} + W_{21}^{(1)}a_2^{(1)} + W_{31}^{(1)}a_3^{(1)} \\ a_1^{(2)} &= \sigma(z_1^{(2)}) \\ z_1^{(3)} &= W_{11}^{(2)}a_1^{(2)} + W_{21}^{(2)}a_2^{(2)} + W_{31}^{(2)}a_3^{(2)} + W_{41}^{(2)}a_4^{(2)}\\ a_1^{(3)} &= \sigma(z_1^{(3)}) \\ z_2^{(3)} &= W_{12}^{(2)}a_1^{(2)} + W_{22}^{(2)}a_2^{(2)} + W_{32}^{(2)}a_3^{(2)} + W_{42}^{(2)}a_4^{(2)}\\ a_2^{(3)} &= \sigma(z_2^{(3)}) \\ z_3^{(3)} &= W_{13}^{(2)}a_1^{(2)} + W_{23}^{(2)}a_2^{(2)} + W_{33}^{(2)}a_3^{(2)} + W_{43}^{(2)}a_4^{(2)}\\ a_3^{(3)} &= \sigma(z_3^{(3)}) \\ z_1^{(4)} &= W_{11}^{(3)}a_1^{(3)} + W_{21}^{(3)}a_2^{(3)} + W_{31}^{(3)}a_3^{(3)} \\ a_1^{(4)} &= \sigma(z_1^{(4)}) \\ E &= -y\ln(a_1^{(4)})-(1-y)\ln(1-a_1^{(4)})\\\\ \end{align}\] \[\text{Recall chain rule:}\] \[\frac{\partial C}{\partial x} = \frac{\partial A}{\partial x} \cdot \frac{\partial B}{\partial A} \cdot \frac{\partial C}{\partial B}\\\\\] \[\text{The following is incorrect use of chain rule:}\] \[\frac{\partial E}{\partial W_{11}^{(1)}} = \frac{\partial E}{\partial a_1^{(4)}} \cdot \frac{\partial a_1^{(4)}}{\partial z_1^{(4)}} \cdot \frac{\partial z_1^{(4)}}{\partial a_1^{(3)}} \cdot \frac{\partial a_1^{(3)}}{\partial z_1^{(3)}} \cdot \frac{\partial z_1^{(3)}}{\partial a_1^{(2)}} \cdot \frac{\partial a_1^{(2)}}{\partial z_1^{(2)}} \cdot \frac{\partial z_1^{(2)}}{\partial W_{11}^{(1)}} \\\\\] \[\begin{align} \text{Incorrect because it only propagates error from }& a_1^{(4)}\to a_1^{(3)}\to a_1^{(2)}\to W_{11}^{(1)} \\ \text{And neglects error that also propagates from }& a_1^{(4)}\to a_2^{(3)}\to a_1^{(2)}\to W_{11}^{(1)} \\ \text{as well as from }& a_1^{(4)}\to a_3^{(3)}\to a_1^{(2)}\to W_{11}^{(1)} \\\\ \end{align}\] \[\text{Correct use of chain rule:}\] \[\frac{\partial E}{\partial W_{11}^{(1)}} = \frac{\partial E}{\partial a_1^{(4)}} \cdot \frac{\partial a_1^{(4)}}{\partial z_1^{(4)}} \cdot \frac{\partial z_1^{(4)}}{\partial a_1^{(2)}} \cdot \frac{\partial a_1^{(2)}}{\partial z_1^{(2)}} \cdot \frac{\partial z_1^{(2)}}{\partial W_{11}^{(1)}} \\\\\] \[\text{Calculating partial derivatives:}\] \[\begin{align} \frac{\partial E}{\partial a_1^{(4)}}&=\frac{\partial}{\partial a_1^{(4)}}(-y\ln(a_1^{(4)})-(1-y)\ln(1-a_1^{(4)}))\\ &=-y\cdot\frac{\partial}{\partial a_1^{(4)}}(\ln(a_1^{(4)})-(1-y)\cdot\frac{\partial}{\partial a_1^{(4)}}(\ln(1-a_1^{(4)}))\\ &=-y\cdot\frac{1}{a_1^{(4)}}\cdot 1-(1-y)\cdot\frac{1}{1- a_1^{(4)}}\cdot -1\\ &=\frac{a_1^{(4)}-y}{a_1^{(4)}(1-a_1^{(4)})} \\\\ \end{align}\] \[\begin{align} \frac{\partial a_1^{(4)}}{\partial z_1^{(4)}} &= \frac{\partial }{\partial z_1^{(4)}}(\sigma(z_1^{(4)})) \\ &= \sigma(z_1^{(4)})(1-\sigma(z_1^{(4)})) &&(\sigma'(x)=\sigma(x)(1-\sigma(x)) \\ &= a_1^{(4)}(1-a_1^{(4)}) &&(a_1^{(4)}=\sigma(z_1^{(4)})) \\\\ \end{align}\] \[\begin{align} \frac{\partial z_1^{(4)}}{\partial a_1^{(2)}} &= \frac{\partial }{\partial a_1^{(2)}}(W_{11}^{(3)}a_1^{(3)} + W_{21}^{(3)}a_2^{(3)} + W_{31}^{(3)}a_3^{(3)}) \\ &= \frac{\partial }{\partial a_1^{(2)}}(W_{11}^{(3)}\sigma(z_1^{(3)}) + W_{21}^{(3)}\sigma(z_2^{(3)}) + W_{31}^{(3)}\sigma(z_3^{(3)})) \\ &= W_{11}^{(3)} \cdot \frac{\partial }{\partial a_1^{(2)}}(\sigma(z_1^{(3)})) + W_{21}^{(3)} \cdot \frac{\partial }{\partial a_1^{(2)}}(\sigma(z_2^{(3)})) + W_{31}^{(3)} \cdot \frac{\partial }{\partial a_1^{(2)}}(\sigma(z_3^{(3)})) \\ &= W_{11}^{(3)} \cdot \sigma(z_1^{(3)})(1-\sigma(z_1^{(3)})) \cdot \frac{\partial }{\partial a_1^{(2)}}(z_1^{(3)}) \\ &\phantom{0000} + W_{21}^{(3)} \cdot \sigma(z_2^{(3)})(1-\sigma(z_2^{(3)})) \cdot \frac{\partial }{\partial a_1^{(2)}}(z_2^{(3)}) \\ &\phantom{0000} + W_{31}^{(3)} \cdot \sigma(z_3^{(3)})(1-\sigma(z_3^{(3)})) \cdot \frac{\partial }{\partial a_1^{(2)}}(z_3^{(3)}) \\ &= W_{11}^{(3)} \cdot a_1^{(3)}(1-a_1^{(3)}) \cdot \frac{\partial }{\partial a_1^{(2)}}(z_1^{(3)}) \\ &\phantom{0000} + W_{21}^{(3)} \cdot a_2^{(3)}(1-a_2^{(3)}) \cdot \frac{\partial }{\partial a_1^{(2)}}(z_2^{(3)}) \\ &\phantom{0000} + W_{31}^{(3)} \cdot a_3^{(3)}(1-a_3^{(3)}) \cdot \frac{\partial }{\partial a_1^{(2)}}(z_3^{(3)}) \\ &= W_{11}^{(3)} \cdot a_1^{(3)}(1-a_1^{(3)}) \cdot \frac{\partial }{\partial a_1^{(2)}}(W_{11}^{(2)}a_1^{(2)} + W_{21}^{(2)}a_2^{(2)} + W_{31}^{(2)}a_3^{(2)} + W_{41}^{(2)}a_4^{(2)}) \\ &\phantom{0000} + W_{21}^{(3)} \cdot a_2^{(3)}(1-a_2^{(3)}) \cdot \frac{\partial }{\partial a_1^{(2)}}(W_{12}^{(2)}a_1^{(2)} + W_{22}^{(2)}a_2^{(2)} + W_{32}^{(2)}a_3^{(2)} + W_{42}^{(2)}a_4^{(2)}) \\ &\phantom{0000} + W_{31}^{(3)} \cdot a_3^{(3)}(1-a_3^{(3)}) \cdot \frac{\partial }{\partial a_1^{(2)}}(W_{13}^{(2)}a_1^{(2)} + W_{23}^{(2)}a_2^{(2)} + W_{33}^{(2)}a_3^{(2)} + W_{43}^{(2)}a_4^{(2)}) \\ &= W_{11}^{(3)} \cdot a_1^{(3)}(1-a_1^{(3)}) \cdot \left(\frac{\partial }{\partial a_1^{(2)}}(W_{11}^{(2)}a_1^{(2)}) + 0 + 0 + 0 \right) \\ &\phantom{0000} + W_{21}^{(3)} \cdot a_2^{(3)}(1-a_2^{(3)}) \cdot \left(\frac{\partial }{\partial a_1^{(2)}}(W_{12}^{(2)}a_1^{(2)}) + 0 + 0 + 0 \right) \\ &\phantom{0000} + W_{31}^{(3)} \cdot a_3^{(3)}(1-a_3^{(3)}) \cdot \left(\frac{\partial }{\partial a_1^{(2)}}(W_{13}^{(2)}a_1^{(2)}) + 0 + 0 + 0 \right) \\ &= \left(W_{11}^{(3)} \cdot a_1^{(3)}(1-a_1^{(3)}) \cdot W_{11}^{(2)}\right) + \left(W_{21}^{(3)} \cdot a_2^{(3)}(1-a_2^{(3)}) \cdot W_{12}^{(2)}\right) + \left(W_{31}^{(3)} \cdot a_3^{(3)}(1-a_3^{(3)}) \cdot W_{13}^{(2)}\right) \\\\ \end{align}\] \[\begin{align} \frac{\partial a_1^{(2)}}{\partial z_1^{(2)}} &= \frac{\partial }{\partial z_1^{(2)}}(\sigma(z_1^{(2)})) \\ &= \sigma(z_1^{(2)})(1-\sigma(z_1^{(2)})) \\ &= a_1^{(2)}(1-a_1^{(2)}) \\\\ \end{align}\] \[\begin{align} \frac{\partial z_1^{(2)}}{\partial W_{11}^{(1)}} &= \frac{\partial }{\partial W_{11}^{(1)}}(W_{11}^{(1)}a_1^{(1)} + W_{21}^{(1)}a_2^{(1)} + W_{31}^{(1)}a_3^{(1)}) \\ &= \frac{\partial }{\partial W_{11}^{(1)}}(W_{11}^{(1)}a_1^{(1)}) + 0 + 0 \\ &= a_1^{(1)} \\\\ \end{align}\] \[\text{Using calculated partial derivatives and chain rule:}\] \[\begin{align} \frac{\partial E}{\partial W_{11}^{(1)}} &= \frac{\partial E}{\partial a_1^{(4)}} \cdot \frac{\partial a_1^{(4)}}{\partial z_1^{(4)}} \cdot \frac{\partial z_1^{(4)}}{\partial a_1^{(2)}} \cdot \frac{\partial a_1^{(2)}}{\partial z_1^{(2)}} \cdot \frac{\partial z_1^{(2)}}{\partial W_{11}^{(1)}} \\\\ &= \left(\frac{a_1^{(4)}-y}{a_1^{(4)}(1-a_1^{(4)})}\right) \\ &\phantom{0000} \cdot \left(a_1^{(4)}(1-a_1^{(4)})\right) \\ &\phantom{0000} \cdot \left(\left(W_{11}^{(3)} \cdot a_1^{(3)}(1-a_1^{(3)}) \cdot W_{11}^{(2)}\right) + \left(W_{21}^{(3)} \cdot a_2^{(3)}(1-a_2^{(3)}) \cdot W_{12}^{(2)}\right) + \left(W_{31}^{(3)} \cdot a_3^{(3)}(1-a_3^{(3)}) \cdot W_{13}^{(2)}\right)\right) \\ &\phantom{0000} \cdot \left(a_1^{(2)}(1-a_1^{(2)})\right) \\ &\phantom{0000} \cdot \left(a_1^{(1)}\right) \ \ \ \ \blacksquare\\\\ \end{align}\]

Calculating \(\frac{\partial}{\partial W^{(1)}_{21}}E\)

For the sample NN:

\[\begin{align} \frac{\partial}{\partial W^{(1)}_{21}}E &= \left(\frac{a_1^{(4)}-y}{a_1^{(4)}(1-a_1^{(4)})}\right) \\ &\phantom{0000} \cdot \left(a_1^{(4)}(1-a_1^{(4)})\right) \\ &\phantom{0000} \cdot \left(\left(W_{11}^{(3)} \cdot a_1^{(3)}(1-a_1^{(3)}) \cdot W_{11}^{(2)}\right) + \left(W_{21}^{(3)} \cdot a_2^{(3)}(1-a_2^{(3)}) \cdot W_{12}^{(2)}\right) + \left(W_{31}^{(3)} \cdot a_3^{(3)}(1-a_3^{(3)}) \cdot W_{13}^{(2)}\right)\right) \\ &\phantom{0000} \cdot \left(a_1^{(2)}(1-a_1^{(2)})\right) \\ &\phantom{0000} \cdot \left(a_2^{(1)}\right) \end{align}\]

Proof.

\[\text{Virtually identical proof as $\frac{\partial}{\partial W^{(1)}_{11}}E$,} \\ \text{Except for one partial derivative:}\] \[\frac{\partial E}{\partial W_{11}^{(1)}} = \frac{\partial E}{\partial a_1^{(4)}} \cdot \frac{\partial a_1^{(4)}}{\partial z_1^{(4)}} \cdot \frac{\partial z_1^{(4)}}{\partial a_1^{(2)}} \cdot \frac{\partial a_1^{(2)}}{\partial z_1^{(2)}} \cdot \frac{\partial z_1^{(2)}}{\partial W_{21}^{(1)}} \\\\\] \[\text{Calculating $\frac{\partial z_1^{(2)}}{\partial W_{21}^{(1)}}$:}\] \[\begin{align} \frac{\partial z_1^{(2)}}{\partial W_{21}^{(1)}} &= \frac{\partial }{\partial W_{21}^{(1)}}(W_{11}^{(1)}a_1^{(1)} + W_{21}^{(1)}a_2^{(1)} + W_{31}^{(1)}a_3^{(1)}) \\ &= 0 + \frac{\partial }{\partial W_{21}^{(1)}}(W_{21}^{(1)}a_2^{(1)}) + 0 \\ &= a_2^{(1)} \\\\ \end{align}\] \[\text{Reusing previously calculated partial derivatives,} \\ \text{and $\frac{\partial z_1^{(2)}}{\partial W_{21}^{(1)}}$, yields desired result} \ \ \ \ \blacksquare\\\\\]

Calculating \(\frac{\partial}{\partial W^{(3)}_{11}}E\)

For the sample NN:

\[\begin{align} \frac{\partial}{\partial W^{(3)}_{11}}E &= \left(\frac{a_1^{(4)}-y}{a_1^{(4)}(1-a_1^{(4)})}\right) \cdot \left(a_1^{(4)}(1-a_1^{(4)})\right) \cdot \left( a_1^{(3)}\right) \end{align}\]

Proof.

\[\text{For simplicity, Let:}\] \[\begin{align} x_1=a_1^{(1)},\ x_2&=a_2^{(1)},\ x_3=a_3^{(1)}, \\ \hat{y}&=a_1^{(4)} \end{align}\] \[\text{Equations from sample NN:}\] \[\begin{align} z_1^{(4)} &= W_{11}^{(3)}a_1^{(3)} + W_{21}^{(3)}a_2^{(3)} + W_{31}^{(3)}a_3^{(3)} \\ a_1^{(4)} &= \sigma(z_1^{(4)}) \\ E &= -y\ln(a_1^{(4)})-(1-y)\ln(1-a_1^{(4)})\\\\ \end{align}\] \[\text{Using chain rule:}\] \[\frac{\partial E}{\partial W_{11}^{(3)}} = \frac{\partial E}{\partial a_1^{(4)}} \cdot \frac{\partial a_1^{(4)}}{\partial z_1^{(4)}} \cdot \frac{\partial z_1^{(4)}}{\partial W_{11}^{(3)}} \\\\\] \[\text{From proof for $\frac{\partial}{\partial W^{(1)}_{11}}E$:}\] \[\frac{\partial E}{\partial a_1^{(4)}}=\frac{a_1^{(4)}-y}{a_1^{(4)}(1-a_1^{(4)})} \\\\\] \[\frac{\partial a_1^{(4)}}{\partial z_1^{(4)}} = a_1^{(4)}(1-a_1^{(4)}) \\\\\] \[\begin{align} \frac{\partial z_1^{(4)}}{\partial W_{11}^{(3)}} &= \frac{\partial }{\partial W_{11}^{(3)}}(W_{11}^{(3)}a_1^{(3)} + W_{21}^{(3)}a_2^{(3)} + W_{31}^{(3)}a_3^{(3)}) \\ &= \frac{\partial }{\partial W_{11}^{(3)}}(W_{11}^{(3)}a_1^{(3)}) + 0 + 0 \\ &= a_1^{(3)} \\\\ \end{align}\] \[\frac{\partial E}{\partial W_{11}^{(3)}} = \left(\frac{a_1^{(4)}-y}{a_1^{(4)}(1-a_1^{(4)})}\right) \cdot \left(a_1^{(4)}(1-a_1^{(4)})\right) \cdot \left( a_1^{(3)}\right) \ \ \ \ \blacksquare\\\\\]

Backprop Algorithm Pseudo-Code

for simplicity, bias units and regularization are left out

Bike Sharing Dataset With Implemented Pseudo-Code

Hyperparameters

Notation (Pythonic)

Training Data (Pythonic)

Object Attributes (Pythonic)

Pseudo-Code (Pythonic)

Notation (Mathematical)

Training Data (Mathematical)

Matrix Visualizations (Mathematical)

\[\begin{align} W^{(l)} &= \begin{bmatrix} W^{(l)}_{11} & W^{(l)}_{21} & W^{(l)}_{31} & \ldots & W^{(l)}_{s_l1} \\ W^{(l)}_{12} & W^{(l)}_{22} & W^{(l)}_{32} \\ W^{(l)}_{13} & W^{(l)}_{23} & W^{(l)}_{33} \\ \vdots & & &\ddots \\ W^{(l)}_{1s_{l+1}} & & & & W^{(l)}_{s_ls_{l+1}} \\ \end{bmatrix} \\\\ X_i &= \begin{bmatrix} x_1 & x_2 & x_3 & \ldots & x_n \end{bmatrix} \\\\ Y_i &= \begin{bmatrix} y_1 & y_2 & y_3 & \ldots & y_c \end{bmatrix} \\\\ a^{(l)} &= \begin{bmatrix} a^{(l)}_1 \\ a^{(l)}_2 \\ a^{(l)}_3 \\ \ldots \\ a^{(l)}_{s_{l}} \end{bmatrix} \\\\ \delta^{(l)} &= \begin{bmatrix} \delta^{(l)}_1 \\ \delta^{(l)}_2 \\ \delta^{(l)}_3 \\ \ldots \\ \delta^{(l)}_{s_{l}} \end{bmatrix} \\\\ \frac{\partial}{\partial W^{(l)}}E &= \begin{bmatrix} \frac{\partial}{\partial W^{(l)}_{11}}E & \frac{\partial}{\partial W^{(l)}_{21}}E & \frac{\partial}{\partial W^{(l)}_{31}}E & \ldots & \frac{\partial}{\partial W^{(l)}_{s_l1}}E \\ \frac{\partial}{\partial W^{(l)}_{12}}E & \frac{\partial}{\partial W^{(l)}_{22}}E & \frac{\partial}{\partial W^{(l)}_{32}}E \\ \frac{\partial}{\partial W^{(l)}_{13}}E & \frac{\partial}{\partial W^{(l)}_{23}}E & \frac{\partial}{\partial W^{(l)}_{33}}E \\ \vdots & & &\ddots \\ \frac{\partial}{\partial W^{(l)}_{1s_{l+1}}}E & & & & \frac{\partial}{\partial W^{(l)}_{s_ls_{l+1}}}E \\ \end{bmatrix} \\\\ \end{align}\]

Pseudo-Code (Mathematical)

“Proof”.

\[\text{WWTP: all matrix multiplications make sense} \\ \text{Take pythonic expression, "plug" in the shapes: }\] \[\begin{align} a[l] &= sigmoid(np.matmul(W[l-1],a[l-1])) \\ (s_l,1) &= sigmoid(np.matmul((s_l,s_{l-1}),(s_{l-1},1))) \\ (s_l,1) &= sigmoid((s_l,1)) \\ (s_l,1) &= (s_l,1) \ \ \ \ \blacksquare\\\\ \end{align}\] \[\begin{align} \delta[l] &= np.matmul(W[l].T,\delta[l+1])*a[l]*(1-a[l]) \\ (s_l,1) &= np.matmul((s_{l+1},s_l).T,(s_{l+1},1))*(s_l,1)*(1-(s_l,1)) \\ (s_l,1) &= np.matmul((s_l,s_{l+1}),(s_{l+1},1))*(s_l,1)*(1-(s_l,1)) \\ (s_l,1) &= (s_l,1)*(s_l,1)*(1-(s_l,1)) \\ (s_l,1) &= (s_l,1)*(s_l,1)*(s_l,1) \\ (s_l,1) &= (s_l,1) \ \ \ \ \blacksquare\\\\ \end{align}\] \[\begin{align} grad\_sum\_W[l] &+= np.matmul(\delta[l+1],a[l].T) \\ (s_{l+1},s_l) &+= np.matmul((s_{l+1},1),(s_l,1).T) \\ (s_{l+1},s_l) &+= np.matmul((s_{l+1},1),(1,s_l)) \\ (s_{l+1},s_l) &+= (s_{l+1},s_l) \ \ \ \ \blacksquare\\\\ \end{align}\]

“Proof”.

\[\text{WWTP: pseudo-code calculates partial derivatives correctly} \\ \text{Informal proof, } \\ \text{use pseudo-code to calculate partial derivatives for sample NN,} \\ \text{check results match with raw calculations.}\\\\\] \[\text{WWTP: Pseudo-Code yields}\] \[\begin{align} \frac{\partial}{\partial W^{(1)}_{11}}E &= \left(\frac{a_1^{(4)}-y}{a_1^{(4)}(1-a_1^{(4)})}\right) \\ &\phantom{0000} \cdot \left(a_1^{(4)}(1-a_1^{(4)})\right) \\ &\phantom{0000} \cdot \left(\left(W_{11}^{(3)} \cdot a_1^{(3)}(1-a_1^{(3)}) \cdot W_{11}^{(2)}\right) + \left(W_{21}^{(3)} \cdot a_2^{(3)}(1-a_2^{(3)}) \cdot W_{12}^{(2)}\right) + \left(W_{31}^{(3)} \cdot a_3^{(3)}(1-a_3^{(3)}) \cdot W_{13}^{(2)}\right)\right) \\ &\phantom{0000} \cdot \left(a_1^{(2)}(1-a_1^{(2)})\right) \\ &\phantom{0000} \cdot \left(a_1^{(1)}\right) \end{align}\] \[\begin{align} \frac{\partial}{\partial W^{(1)}_{21}}E &= \left(\frac{a_1^{(4)}-y}{a_1^{(4)}(1-a_1^{(4)})}\right) \\ &\phantom{0000} \cdot \left(a_1^{(4)}(1-a_1^{(4)})\right) \\ &\phantom{0000} \cdot \left(\left(W_{11}^{(3)} \cdot a_1^{(3)}(1-a_1^{(3)}) \cdot W_{11}^{(2)}\right) + \left(W_{21}^{(3)} \cdot a_2^{(3)}(1-a_2^{(3)}) \cdot W_{12}^{(2)}\right) + \left(W_{31}^{(3)} \cdot a_3^{(3)}(1-a_3^{(3)}) \cdot W_{13}^{(2)}\right)\right) \\ &\phantom{0000} \cdot \left(a_1^{(2)}(1-a_1^{(2)})\right) \\ &\phantom{0000} \cdot \left(a_2^{(1)}\right) \end{align}\] \[\begin{align} \frac{\partial}{\partial W^{(3)}_{11}}E &= \left(\frac{a_1^{(4)}-y}{a_1^{(4)}(1-a_1^{(4)})}\right) \cdot \left(a_1^{(4)}(1-a_1^{(4)})\right) \cdot \left( a_1^{(3)}\right) \\\\ &=(a_1^{(4)}-y)a_1^{(3)} \\\\ \end{align}\] \[\text{Executing Pseudo-Code (Mathematical):} \\ \text{Forward Propagation, calculating activations:}\] \[\begin{align} a^{(1)} &= \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} \\\\ a^{(2)} &= \sigma(W^{(1)}a^{(1)}) \\ &=\sigma\left( \begin{bmatrix} W^{(1)}_{11} & W^{(1)}_{21} & W^{(1)}_{31} \\ W^{(1)}_{12} & W^{(1)}_{22} & W^{(1)}_{32} \\ W^{(1)}_{13} & W^{(1)}_{23} & W^{(1)}_{33} \\ W^{(1)}_{14} & W^{(1)}_{24} & W^{(1)}_{34} \\ \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} \right) \\ &=\sigma\left( \begin{bmatrix} W^{(1)}_{11}x_1 + W^{(1)}_{21}x_2 + W^{(1)}_{31}x_3 \\ W^{(1)}_{12}x_1 + W^{(1)}_{22}x_2 + W^{(1)}_{32}x_3 \\ W^{(1)}_{13}x_1 + W^{(1)}_{23}x_2 + W^{(1)}_{33}x_3 \\ W^{(1)}_{14}x_1 + W^{(1)}_{24}x_2 + W^{(1)}_{34}x_3 \\ \end{bmatrix} \right) \\ &=\begin{bmatrix} \sigma\left(W^{(1)}_{11}x_1 + W^{(1)}_{21}x_2 + W^{(1)}_{31}x_3\right) \\ \sigma\left(W^{(1)}_{12}x_1 + W^{(1)}_{22}x_2 + W^{(1)}_{32}x_3\right) \\ \sigma\left(W^{(1)}_{13}x_1 + W^{(1)}_{23}x_2 + W^{(1)}_{33}x_3\right) \\ \sigma\left(W^{(1)}_{14}x_1 + W^{(1)}_{24}x_2 + W^{(1)}_{34}x_3\right) \\ \end{bmatrix} \\\\ a^{(3)} &= \sigma(W^{(2)}a^{(2)}) \\ &=\sigma\left( \begin{bmatrix} W^{(2)}_{11} & W^{(2)}_{21} & W^{(2)}_{31} & W^{(2)}_{41} \\ W^{(2)}_{12} & W^{(2)}_{22} & W^{(2)}_{32} & W^{(2)}_{42} \\ W^{(2)}_{13} & W^{(2)}_{23} & W^{(2)}_{33} & W^{(2)}_{43} \\ \end{bmatrix} \begin{bmatrix} a^{(2)}_1 \\ a^{(2)}_2 \\ a^{(2)}_3 \\ a^{(2)}_4 \end{bmatrix} \right) \\ &=\sigma\left( \begin{bmatrix} W^{(2)}_{11}a^{(2)}_1 + W^{(2)}_{21}a^{(2)}_2 + W^{(2)}_{31}a^{(2)}_3 + W^{(2)}_{41}a^{(2)}_4 \\ W^{(2)}_{12}a^{(2)}_1 + W^{(2)}_{22}a^{(2)}_2 + W^{(2)}_{32}a^{(2)}_3 + W^{(2)}_{42}a^{(2)}_4 \\ W^{(2)}_{13}a^{(2)}_1 + W^{(2)}_{23}a^{(2)}_2 + W^{(2)}_{33}a^{(2)}_3 + W^{(2)}_{43}a^{(2)}_4 \\ \end{bmatrix} \right) \\ &=\begin{bmatrix} \sigma\left(W^{(2)}_{11}a^{(2)}_1 + W^{(2)}_{21}a^{(2)}_2 + W^{(2)}_{31}a^{(2)}_3 + W^{(2)}_{41}a^{(2)}_4\right) \\ \sigma\left(W^{(2)}_{12}a^{(2)}_1 + W^{(2)}_{22}a^{(2)}_2 + W^{(2)}_{32}a^{(2)}_3 + W^{(2)}_{42}a^{(2)}_4\right) \\ \sigma\left(W^{(2)}_{13}a^{(2)}_1 + W^{(2)}_{23}a^{(2)}_2 + W^{(2)}_{33}a^{(2)}_3 + W^{(2)}_{43}a^{(2)}_4\right) \\ \end{bmatrix} \\\\ a^{(4)} &= \sigma(W^{(3)}a^{(3)}) \\ &=\sigma\left( \begin{bmatrix} W^{(3)}_{11} & W^{(3)}_{21} & W^{(3)}_{31} \\ \end{bmatrix} \begin{bmatrix} a^{(3)}_1 \\ a^{(3)}_2 \\ a^{(3)}_3 \end{bmatrix} \right) \\ &=\sigma\left( \begin{bmatrix} W^{(3)}_{11}a^{(3)}_1 + W^{(3)}_{21}a^{(3)}_2 + W^{(3)}_{31}a^{(3)}_3 \\ \end{bmatrix} \right) \\ &=\begin{bmatrix} \sigma\left(W^{(3)}_{11}a^{(3)}_1 + W^{(3)}_{21}a^{(3)}_2 + W^{(3)}_{31}a^{(3)}_3\right) \\ \end{bmatrix} \\\\ \end{align}\] \[\text{Backpropagation, calculating errors:}\] \[\begin{align} \delta^{(4)} &= a^{(4)}-Y_i^T \\ &= \begin{bmatrix} \sigma\left(W^{(3)}_{11}a^{(3)}_1 + W^{(3)}_{21}a^{(3)}_2 + W^{(3)}_{31}a^{(3)}_3\right) \\ \end{bmatrix} -\begin{bmatrix}y_1\end{bmatrix}^T \\ &= \begin{bmatrix} \sigma\left(W^{(3)}_{11}a^{(3)}_1 + W^{(3)}_{21}a^{(3)}_2 + W^{(3)}_{31}a^{(3)}_3\right) \\ \end{bmatrix} -\begin{bmatrix}y_1\end{bmatrix} \\ &= \begin{bmatrix} a^{(4)}_1 - y_1 \\ \end{bmatrix}\\\\ &\text{For simplicity and consistency, Let $y=y_1$:} \\ \delta^{(4)} &= a^{(4)}_1 - y \\\\ \delta^{(3)} &= W^{(3)T}\delta^{(4)} \circ a^{(3)} \circ (1-a^{(3)}) \\ &= \begin{bmatrix} W^{(3)}_{11} & W^{(3)}_{21} & W^{(3)}_{31} \\ \end{bmatrix}^T \begin{bmatrix} a^{(4)}_1 - y \\ \end{bmatrix} \circ a^{(3)} \circ (1-a^{(3)}) \\ &= \begin{bmatrix} W^{(3)}_{11} \\ W^{(3)}_{21} \\ W^{(3)}_{31} \\ \end{bmatrix} \begin{bmatrix} a^{(4)}_1 - y \\ \end{bmatrix} \circ a^{(3)} \circ (1-a^{(3)}) \\ &= \begin{bmatrix} W^{(3)}_{11}(a^{(4)}_1 - y) \\ W^{(3)}_{21}(a^{(4)}_1 - y) \\ W^{(3)}_{31}(a^{(4)}_1 - y) \\ \end{bmatrix} \circ a^{(3)} \circ (1-a^{(3)}) \\ &= \begin{bmatrix} W^{(3)}_{11}(a^{(4)}_1 - y) \\ W^{(3)}_{21}(a^{(4)}_1 - y) \\ W^{(3)}_{31}(a^{(4)}_1 - y) \\ \end{bmatrix} \circ \begin{bmatrix}a^{(3)}_1 \\ a^{(3)}_2 \\ a^{(3)}_3 \end{bmatrix} \circ (1 - \begin{bmatrix}a^{(3)}_1 \\ a^{(3)}_2 \\ a^{(3)}_3 \end{bmatrix}) \\ &= \begin{bmatrix} W^{(3)}_{11}(a^{(4)}_1 - y) \\ W^{(3)}_{21}(a^{(4)}_1 - y) \\ W^{(3)}_{31}(a^{(4)}_1 - y) \\ \end{bmatrix} \circ \begin{bmatrix}a^{(3)}_1 \\ a^{(3)}_2 \\ a^{(3)}_3 \end{bmatrix} \circ \begin{bmatrix}1 - a^{(3)}_1 \\ 1 - a^{(3)}_2 \\ 1 - a^{(3)}_3 \end{bmatrix} \\ &= \begin{bmatrix} W^{(3)}_{11}(a^{(4)}_1 - y)a^{(3)}_1(1 - a^{(3)}_1) \\ W^{(3)}_{21}(a^{(4)}_1 - y)a^{(3)}_2(1 - a^{(3)}_2) \\ W^{(3)}_{31}(a^{(4)}_1 - y)a^{(3)}_3(1 - a^{(3)}_3) \\ \end{bmatrix}\\\\ \end{align}\] \[\begin{align} \delta^{(2)} &= W^{(2)T}\delta^{(3)} \circ a^{(2)} \circ (1-a^{(2)}) \\ &= \begin{bmatrix} W^{(2)}_{11} & W^{(2)}_{21} & W^{(2)}_{31} & W^{(2)}_{41} \\ W^{(2)}_{12} & W^{(2)}_{22} & W^{(2)}_{32} & W^{(2)}_{42} \\ W^{(2)}_{13} & W^{(2)}_{23} & W^{(2)}_{33} & W^{(2)}_{43} \\ \end{bmatrix}^T \begin{bmatrix} W^{(3)}_{11}(a^{(4)}_1 - y)a^{(3)}_1(1 - a^{(3)}_1) \\ W^{(3)}_{21}(a^{(4)}_1 - y)a^{(3)}_2(1 - a^{(3)}_2) \\ W^{(3)}_{31}(a^{(4)}_1 - y)a^{(3)}_3(1 - a^{(3)}_3) \\ \end{bmatrix} \circ a^{(2)} \circ (1-a^{(2)}) \\ &= \begin{bmatrix} W^{(2)}_{11} & W^{(2)}_{12} & W^{(2)}_{13} \\ W^{(2)}_{21} & W^{(2)}_{22} & W^{(2)}_{23} \\ W^{(2)}_{31} & W^{(2)}_{32} & W^{(2)}_{33} \\ W^{(2)}_{41} & W^{(2)}_{42} & W^{(2)}_{43} \\ \end{bmatrix} \begin{bmatrix} W^{(3)}_{11}(a^{(4)}_1 - y)a^{(3)}_1(1 - a^{(3)}_1) \\ W^{(3)}_{21}(a^{(4)}_1 - y)a^{(3)}_2(1 - a^{(3)}_2) \\ W^{(3)}_{31}(a^{(4)}_1 - y)a^{(3)}_3(1 - a^{(3)}_3) \\ \end{bmatrix} \circ a^{(2)} \circ (1-a^{(2)}) \\ &= \begin{bmatrix} \left(\left(W^{(3)}_{11}(a^{(4)}_1 - y)a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{11}\right) + \left(W^{(3)}_{21}(a^{(4)}_1 - y)a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{12}\right) + \left(W^{(3)}_{31}(a^{(4)}_1 - y)a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{13}\right)\right) \\ \left(\left(W^{(3)}_{11}(a^{(4)}_1 - y)a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{21}\right) + \left(W^{(3)}_{21}(a^{(4)}_1 - y)a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{22}\right) + \left(W^{(3)}_{31}(a^{(4)}_1 - y)a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{23}\right)\right) \\ \left(\left(W^{(3)}_{11}(a^{(4)}_1 - y)a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{31}\right) + \left(W^{(3)}_{21}(a^{(4)}_1 - y)a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{32}\right) + \left(W^{(3)}_{31}(a^{(4)}_1 - y)a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{33}\right)\right) \\ \left(\left(W^{(3)}_{11}(a^{(4)}_1 - y)a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{41}\right) + \left(W^{(3)}_{21}(a^{(4)}_1 - y)a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{42}\right) + \left(W^{(3)}_{31}(a^{(4)}_1 - y)a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{43}\right)\right) \\ \end{bmatrix} \circ a^{(2)} \circ (1-a^{(2)}) \\\\ &= \begin{bmatrix} (a^{(4)}_1 - y)\left(\left(W^{(3)}_{11}a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{11}\right) + \left(W^{(3)}_{21}a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{12}\right) + \left(W^{(3)}_{31}a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{13}\right)\right) \\ (a^{(4)}_1 - y)\left(\left(W^{(3)}_{11}a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{21}\right) + \left(W^{(3)}_{21}a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{22}\right) + \left(W^{(3)}_{31}a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{23}\right)\right) \\ (a^{(4)}_1 - y)\left(\left(W^{(3)}_{11}a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{31}\right) + \left(W^{(3)}_{21}a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{32}\right) + \left(W^{(3)}_{31}a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{33}\right)\right) \\ (a^{(4)}_1 - y)\left(\left(W^{(3)}_{11}a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{41}\right) + \left(W^{(3)}_{21}a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{42}\right) + \left(W^{(3)}_{31}a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{43}\right)\right) \\ \end{bmatrix} \circ a^{(2)} \circ (1-a^{(2)}) \\\\ &= \begin{bmatrix} (a^{(4)}_1 - y)\left(\left(W^{(3)}_{11}a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{11}\right) + \left(W^{(3)}_{21}a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{12}\right) + \left(W^{(3)}_{31}a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{13}\right)\right) \\ (a^{(4)}_1 - y)\left(\left(W^{(3)}_{11}a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{21}\right) + \left(W^{(3)}_{21}a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{22}\right) + \left(W^{(3)}_{31}a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{23}\right)\right) \\ (a^{(4)}_1 - y)\left(\left(W^{(3)}_{11}a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{31}\right) + \left(W^{(3)}_{21}a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{32}\right) + \left(W^{(3)}_{31}a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{33}\right)\right) \\ (a^{(4)}_1 - y)\left(\left(W^{(3)}_{11}a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{41}\right) + \left(W^{(3)}_{21}a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{42}\right) + \left(W^{(3)}_{31}a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{43}\right)\right) \\ \end{bmatrix} \circ \begin{bmatrix}a^{(2)}_1 \\ a^{(2)}_2 \\ a^{(2)}_3 \\ a^{(2)}_4\end{bmatrix} \circ (1-\begin{bmatrix}a^{(2)}_1 \\ a^{(2)}_2 \\ a^{(2)}_3 \\ a^{(2)}_4\end{bmatrix}) \\\\ &= \begin{bmatrix} (a^{(4)}_1 - y)\left(\left(W^{(3)}_{11}a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{11}\right) + \left(W^{(3)}_{21}a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{12}\right) + \left(W^{(3)}_{31}a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{13}\right)\right)a^{(2)}_1(1-a^{(2)}_1) \\ (a^{(4)}_1 - y)\left(\left(W^{(3)}_{11}a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{21}\right) + \left(W^{(3)}_{21}a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{22}\right) + \left(W^{(3)}_{31}a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{23}\right)\right)a^{(2)}_2(1-a^{(2)}_2) \\ (a^{(4)}_1 - y)\left(\left(W^{(3)}_{11}a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{31}\right) + \left(W^{(3)}_{21}a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{32}\right) + \left(W^{(3)}_{31}a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{33}\right)\right)a^{(2)}_3(1-a^{(2)}_3) \\ (a^{(4)}_1 - y)\left(\left(W^{(3)}_{11}a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{41}\right) + \left(W^{(3)}_{21}a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{42}\right) + \left(W^{(3)}_{31}a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{43}\right)\right)a^{(2)}_4(1-a^{(2)}_4) \\ \end{bmatrix} \\\\ \end{align}\] \[\text{Calculating partial derivatives using errors:}\] \[\begin{align} \frac{\partial}{\partial W^{(3)}}E &= \delta^{(4)}a^{(3)T} \\ &= \begin{bmatrix}a^{(4)}_1 - y \end{bmatrix}\begin{bmatrix}a^{(3)}_1 \\ a^{(3)}_2 \\ a^{(3)}_3\end{bmatrix}^T \\ &= \begin{bmatrix}a^{(4)}_1 - y \end{bmatrix}\begin{bmatrix}a^{(3)}_1 & a^{(3)}_2 & a^{(3)}_3\end{bmatrix} \\ &= \begin{bmatrix}(a^{(4)}_1 - y)a^{(3)}_1 & (a^{(4)}_1 - y)a^{(3)}_2 & (a^{(4)}_1 - y)a^{(3)}_3\end{bmatrix} \\ \end{align}\] \[\begin{align} \frac{\partial}{\partial W^{(2)}}E &= \delta^{(3)}a^{(2)T} \\ &= \begin{bmatrix} W^{(3)}_{11}(a^{(4)}_1 - y)a^{(3)}_1(1 - a^{(3)}_1) \\ W^{(3)}_{21}(a^{(4)}_1 - y)a^{(3)}_2(1 - a^{(3)}_2) \\ W^{(3)}_{31}(a^{(4)}_1 - y)a^{(3)}_3(1 - a^{(3)}_3) \\ \end{bmatrix} \begin{bmatrix} a^{(2)}_1 \\ a^{(2)}_2 \\ a^{(2)}_3 \\ a^{(2)}_4 \end{bmatrix}^T \\ &= \begin{bmatrix} W^{(3)}_{11}(a^{(4)}_1 - y)a^{(3)}_1(1 - a^{(3)}_1) \\ W^{(3)}_{21}(a^{(4)}_1 - y)a^{(3)}_2(1 - a^{(3)}_2) \\ W^{(3)}_{31}(a^{(4)}_1 - y)a^{(3)}_3(1 - a^{(3)}_3) \\ \end{bmatrix} \begin{bmatrix} a^{(2)}_1 & a^{(2)}_2 & a^{(2)}_3 & a^{(2)}_4 \end{bmatrix} \\ &= \begin{bmatrix} \left(W^{(3)}_{11}(a^{(4)}_1 - y)a^{(3)}_1(1 - a^{(3)}_1)a^{(2)}_1\right) & \left(W^{(3)}_{11}(a^{(4)}_1 - y)a^{(3)}_1(1 - a^{(3)}_1)a^{(2)}_2\right) & \left(W^{(3)}_{11}(a^{(4)}_1 - y)a^{(3)}_1(1 - a^{(3)}_1)a^{(2)}_3\right) & \left(W^{(3)}_{11}(a^{(4)}_1 - y)a^{(3)}_1(1 - a^{(3)}_1)a^{(2)}_4\right) \\ \left(W^{(3)}_{21}(a^{(4)}_1 - y)a^{(3)}_2(1 - a^{(3)}_2)a^{(2)}_1\right) & \left(W^{(3)}_{21}(a^{(4)}_1 - y)a^{(3)}_2(1 - a^{(3)}_2)a^{(2)}_2\right) & \left(W^{(3)}_{21}(a^{(4)}_1 - y)a^{(3)}_2(1 - a^{(3)}_2)a^{(2)}_3\right) & \left(W^{(3)}_{21}(a^{(4)}_1 - y)a^{(3)}_2(1 - a^{(3)}_2)a^{(2)}_4\right) \\ \left(W^{(3)}_{31}(a^{(4)}_1 - y)a^{(3)}_3(1 - a^{(3)}_3)a^{(2)}_1\right) & \left(W^{(3)}_{31}(a^{(4)}_1 - y)a^{(3)}_3(1 - a^{(3)}_3)a^{(2)}_2\right) & \left(W^{(3)}_{31}(a^{(4)}_1 - y)a^{(3)}_3(1 - a^{(3)}_3)a^{(2)}_3\right) & \left(W^{(3)}_{31}(a^{(4)}_1 - y)a^{(3)}_3(1 - a^{(3)}_3)a^{(2)}_4\right) \\ \end{bmatrix} \end{align}\] \[\begin{align} \frac{\partial}{\partial W^{(1)}}E &= \delta^{(2)}a^{(1)T} \\ &=\begin{bmatrix} (a^{(4)}_1 - y)\left(\left(W^{(3)}_{11}a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{11}\right) + \left(W^{(3)}_{21}a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{12}\right) + \left(W^{(3)}_{31}a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{13}\right)\right)a^{(2)}_1(1-a^{(2)}_1) \\ (a^{(4)}_1 - y)\left(\left(W^{(3)}_{11}a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{21}\right) + \left(W^{(3)}_{21}a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{22}\right) + \left(W^{(3)}_{31}a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{23}\right)\right)a^{(2)}_2(1-a^{(2)}_2) \\ (a^{(4)}_1 - y)\left(\left(W^{(3)}_{11}a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{31}\right) + \left(W^{(3)}_{21}a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{32}\right) + \left(W^{(3)}_{31}a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{33}\right)\right)a^{(2)}_3(1-a^{(2)}_3) \\ (a^{(4)}_1 - y)\left(\left(W^{(3)}_{11}a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{41}\right) + \left(W^{(3)}_{21}a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{42}\right) + \left(W^{(3)}_{31}a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{43}\right)\right)a^{(2)}_4(1-a^{(2)}_4) \\ \end{bmatrix} \begin{bmatrix}a^{(1)}_1 \\ a^{(1)}_2 \\ a^{(1)}_3\end{bmatrix}^T \\ &=\begin{bmatrix} (a^{(4)}_1 - y)\left(\left(W^{(3)}_{11}a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{11}\right) + \left(W^{(3)}_{21}a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{12}\right) + \left(W^{(3)}_{31}a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{13}\right)\right)a^{(2)}_1(1-a^{(2)}_1) \\ (a^{(4)}_1 - y)\left(\left(W^{(3)}_{11}a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{21}\right) + \left(W^{(3)}_{21}a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{22}\right) + \left(W^{(3)}_{31}a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{23}\right)\right)a^{(2)}_2(1-a^{(2)}_2) \\ (a^{(4)}_1 - y)\left(\left(W^{(3)}_{11}a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{31}\right) + \left(W^{(3)}_{21}a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{32}\right) + \left(W^{(3)}_{31}a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{33}\right)\right)a^{(2)}_3(1-a^{(2)}_3) \\ (a^{(4)}_1 - y)\left(\left(W^{(3)}_{11}a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{41}\right) + \left(W^{(3)}_{21}a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{42}\right) + \left(W^{(3)}_{31}a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{43}\right)\right)a^{(2)}_4(1-a^{(2)}_4) \\ \end{bmatrix} \begin{bmatrix}a^{(1)}_1 & a^{(1)}_2 & a^{(1)}_3\end{bmatrix} \\ &=\begin{bmatrix} \left((a^{(4)}_1 - y)\left(\left(W^{(3)}_{11}a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{11}\right) + \left(W^{(3)}_{21}a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{12}\right) + \left(W^{(3)}_{31}a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{13}\right)\right)a^{(2)}_1(1-a^{(2)}_1)a^{(1)}_1\right) & \left((a^{(4)}_1 - y)\left(\left(W^{(3)}_{11}a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{11}\right) + \left(W^{(3)}_{21}a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{12}\right) + \left(W^{(3)}_{31}a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{13}\right)\right)a^{(2)}_1(1-a^{(2)}_1)a^{(1)}_2\right) & \left((a^{(4)}_1 - y)\left(\left(W^{(3)}_{11}a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{11}\right) + \left(W^{(3)}_{21}a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{12}\right) + \left(W^{(3)}_{31}a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{13}\right)\right)a^{(2)}_1(1-a^{(2)}_1)a^{(1)}_3\right) \\ \left((a^{(4)}_1 - y)\left(\left(W^{(3)}_{11}a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{21}\right) + \left(W^{(3)}_{21}a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{22}\right) + \left(W^{(3)}_{31}a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{23}\right)\right)a^{(2)}_2(1-a^{(2)}_2)a^{(1)}_1\right) & \left((a^{(4)}_1 - y)\left(\left(W^{(3)}_{11}a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{21}\right) + \left(W^{(3)}_{21}a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{22}\right) + \left(W^{(3)}_{31}a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{23}\right)\right)a^{(2)}_2(1-a^{(2)}_2)a^{(1)}_2\right) & \left((a^{(4)}_1 - y)\left(\left(W^{(3)}_{11}a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{21}\right) + \left(W^{(3)}_{21}a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{22}\right) + \left(W^{(3)}_{31}a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{23}\right)\right)a^{(2)}_2(1-a^{(2)}_2)a^{(1)}_3\right) \\ \left((a^{(4)}_1 - y)\left(\left(W^{(3)}_{11}a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{31}\right) + \left(W^{(3)}_{21}a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{32}\right) + \left(W^{(3)}_{31}a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{33}\right)\right)a^{(2)}_3(1-a^{(2)}_3)a^{(1)}_1\right) & \left((a^{(4)}_1 - y)\left(\left(W^{(3)}_{11}a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{31}\right) + \left(W^{(3)}_{21}a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{32}\right) + \left(W^{(3)}_{31}a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{33}\right)\right)a^{(2)}_3(1-a^{(2)}_3)a^{(1)}_2\right) & \left((a^{(4)}_1 - y)\left(\left(W^{(3)}_{11}a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{31}\right) + \left(W^{(3)}_{21}a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{32}\right) + \left(W^{(3)}_{31}a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{33}\right)\right)a^{(2)}_3(1-a^{(2)}_3)a^{(1)}_3\right) \\ \left((a^{(4)}_1 - y)\left(\left(W^{(3)}_{11}a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{41}\right) + \left(W^{(3)}_{21}a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{42}\right) + \left(W^{(3)}_{31}a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{43}\right)\right)a^{(2)}_4(1-a^{(2)}_4)a^{(1)}_1\right) & \left((a^{(4)}_1 - y)\left(\left(W^{(3)}_{11}a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{41}\right) + \left(W^{(3)}_{21}a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{42}\right) + \left(W^{(3)}_{31}a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{43}\right)\right)a^{(2)}_4(1-a^{(2)}_4)a^{(1)}_2\right) & \left((a^{(4)}_1 - y)\left(\left(W^{(3)}_{11}a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{41}\right) + \left(W^{(3)}_{21}a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{42}\right) + \left(W^{(3)}_{31}a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{43}\right)\right)a^{(2)}_4(1-a^{(2)}_4)a^{(1)}_3\right) \\ \end{bmatrix} \\\\ \end{align}\] \[\text{According to calculations:}\] \[\begin{align} \frac{\partial}{\partial W^{(3)}_{11}}E &= (a^{(4)}_1 - y)a^{(3)}_1 \\ \frac{\partial}{\partial W^{(1)}_{11}}E &= (a^{(4)}_1 - y)\left(\left(W^{(3)}_{11}a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{11}\right) + \left(W^{(3)}_{21}a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{12}\right) + \left(W^{(3)}_{31}a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{13}\right)\right)a^{(2)}_1(1-a^{(2)}_1)a^{(1)}_1 \\ \frac{\partial}{\partial W^{(1)}_{21}}E &= (a^{(4)}_1 - y)\left(\left(W^{(3)}_{11}a^{(3)}_1(1 - a^{(3)}_1)W^{(2)}_{11}\right) + \left(W^{(3)}_{21}a^{(3)}_2(1 - a^{(3)}_2)W^{(2)}_{12}\right) + \left(W^{(3)}_{31}a^{(3)}_3(1 - a^{(3)}_3)W^{(2)}_{13}\right)\right)a^{(2)}_1(1-a^{(2)}_1)a^{(1)}_2 \\\\ \end{align}\] \[\text{Therefore the results from pseudo-code match raw calculations} \ \ \ \ \blacksquare\\\\\]

Backprop Clarifying Discrepancies

According to pseudo-code:

\[\text{error} = \text{pred} - \text{label}\] \[\begin{align} \text{weights } &-= \text{learnrate} * \text{partial derivative} \\ &-= \text{learnrate} * (\text{pred} - \text{label})\ldots \\ \end{align}\]

A variation that yields the same results (as seen in Udacity DL course):

\[\text{error} = \text{label} - \text{pred}\] \[\begin{align} \text{weights } &+= \text{learnrate} * \text{partial derivative} \\ &+= \text{learnrate} * (\text{label} - \text{pred})\ldots \\ \end{align}\]

Both versions of pseudo-code yield identical results.

“Proof”.

\[\text{In the end the pseudo-code's purpose,} \\ \text{is to calculate the partial derivative of error with respect to weights,} \\ \text{and this is done using the chain rule, for example:} \\\] \[\begin{align} \frac{\partial E}{\partial \hat{y}}&=\frac{\partial}{\partial \hat{y}}(-y\ln(\hat{y})-(1-y)\ln(1-\hat{y}))\\ &=\frac{\hat{y}-y}{\hat{y}(1-\hat{y})} \\\\ \end{align}\] \[\text{Therefore the following has to be true:} \\ \text{error} = \text{pred} - \text{label}\] \[\text{And to update the weights, we subtract the partial derivative from weight:}\] \[\begin{align} \text{weights } &-= \text{learnrate} * \text{partial derivative} \\ &-= \text{learnrate} * (\text{pred} - \text{label})\ldots \\\\ \end{align}\] \[\text{The previous steps validated the pseudo-code, but not the variation,} \\ \text{The variation is essentially the same, except the negative is "distributed":} \\\] \[\begin{align} \text{weights } &-= \text{learnrate} * \text{partial derivative} \\ &-= \text{learnrate} * (\text{pred} - \text{label})\ldots \\ &+= \text{learnrate} * -1(\text{pred} - \text{label})\ldots \\ &+= \text{learnrate} * (\text{label}-\text{pred})\ldots \\ &\implies \text{error} = \text{label}-\text{pred} \end{align}\] \[\text{And we find the variation yields the same results as the original.} \ \ \ \ \blacksquare\\\\\]

Discovering Backprop Formula From Scratch

  1. Try manually computing single partial derivative (like before using the chain rule) for some sample neural network
  2. Try computing multiple partial derivatives
  3. Note the patterns, transform every part in chain rule into matrix multiplication to calculate all partial derivatives for a layer, for the sample NN
  4. Generalize calculations to NN of any size
    • Notice how coming up with any algorithm requires: performing manually, then generalizing results

NN Toolkit

Creating NN