ML Notes

View on GitHub

Recurrent Neural Networks

Backpropagation Through Time (BTT)

method to train RNN at specific time point \(t\), while taking into account previous time points

given the simple RNN provided in diagrams above

Sample Calculation

ml-notes_24 ml-notes_25

The partial derivatives of error, specifically at time \(t=3\), with respect to every weight matrix :

\[\begin{align} \frac{\partial E_3}{\partial W_y} &= \frac{\partial E_3}{\partial \bar{y}_3}\frac{\partial \bar{y}_3}{\partial W_y}\\\\ \frac{\partial E_3}{\partial W_s} &= \frac{\partial E_3}{\partial \bar{y}_3}\frac{\partial \bar{y}_3}{\partial \bar{s}_3}\frac{\partial \bar{s}_3}{\partial W_s} +\\ &\phantom{000}\frac{\partial E_3}{\partial \bar{y}_3}\frac{\partial \bar{y}_3}{\partial \bar{s}_3}\frac{\partial \bar{s}_3}{\partial \bar{s}_2}\frac{\partial \bar{s}_2}{\partial W_s}+\\ &\phantom{000}\frac{\partial E_3}{\partial \bar{y}_3}\frac{\partial \bar{y}_3}{\partial \bar{s}_3}\frac{\partial \bar{s}_3}{\partial \bar{s}_2}\frac{\partial \bar{s}_2}{\partial \bar{s}_1}\frac{\partial \bar{s}_1}{\partial W_s}\\\\ \frac{\partial E_3}{\partial W_x} &= \frac{\partial E_3}{\partial \bar{y}_3}\frac{\partial \bar{y}_3}{\partial \bar{s}_3}\frac{\partial \bar{s}_3}{\partial W_x} +\\ &\phantom{000}\frac{\partial E_3}{\partial \bar{y}_3}\frac{\partial \bar{y}_3}{\partial \bar{s}_3}\frac{\partial \bar{s}_3}{\partial \bar{s}_2}\frac{\partial \bar{s}_2}{\partial W_x}+\\ &\phantom{000}\frac{\partial E_3}{\partial \bar{y}_3}\frac{\partial \bar{y}_3}{\partial \bar{s}_3}\frac{\partial \bar{s}_3}{\partial \bar{s}_2}\frac{\partial \bar{s}_2}{\partial \bar{s}_1}\frac{\partial \bar{s}_1}{\partial W_x}\\ \end{align}\]

“Proof.”

Just like the NN chapter, when we calculate partial derivative of error with respect to specific set of weights simply:

  1. List all equations that are used to calculate hidden nodes, output nodes, error, using the inputs
  2. Plug all equations into the error equation, replacing variables
  3. Take partial derivative of that error with respect to desired weight
  4. Use derivative properties to simplify (especially chain rule)

An alternative to the above method is to:

  1. For every neuron (state) where the desired weight contributes to that state, and the state contributes to the error:
    • Use chain rule to construct partial derivative of error with respect to specific desired weight and neuron (state)
  2. Calculate overall partial derivative of error with respect to desired weight by:
    • Adding every partial derivative calculated from previous step (we are accumulating the gradient contributions from all states)

For example:

\[\text{Given the desired weight, } W_s \text{ , calculating } \frac{\partial E_3}{\partial W_s}\text{ :}\] \[s_3 \text{ is a state where } W_s \text{ contributes to } s_3 \text{ , and } s_3 \text{ contributes to } E \\ \frac{\partial E_3}{\partial W_s} = \frac{\partial E_3}{\partial \bar{y}_3}\frac{\partial \bar{y}_3}{\partial \bar{s}_3}\frac{\partial \bar{s}_3}{\partial W_s}\] \[s_2 \text{ is a state where } W_s \text{ contributes to } s_2 \text{ , and } s_2 \text{ contributes to } E \\ \frac{\partial E_3}{\partial W_s} = \frac{\partial E_3}{\partial \bar{y}_3}\frac{\partial \bar{y}_3}{\partial \bar{s}_3}\frac{\partial \bar{s}_3}{\partial \bar{s}_2}\frac{\partial \bar{s}_2}{\partial W_s}\] \[s_1 \text{ is a state where } W_s \text{ contributes to } s_1 \text{ , and } s_1 \text{ contributes to } E \\ \frac{\partial E_3}{\partial W_s} = \frac{\partial E_3}{\partial \bar{y}_3}\frac{\partial \bar{y}_3}{\partial \bar{s}_3}\frac{\partial \bar{s}_3}{\partial \bar{s}_2}\frac{\partial \bar{s}_2}{\partial \bar{s}_1}\frac{\partial \bar{s}_1}{\partial W_s}\]

\(\text{Accumulating gradient contributions from every state:}\\\\\) \(\begin{align} \frac{\partial E_3}{\partial W_s} &= \frac{\partial E_3}{\partial \bar{y}_3}\frac{\partial \bar{y}_3}{\partial \bar{s}_3}\frac{\partial \bar{s}_3}{\partial W_s} +\\ &\phantom{000}\frac{\partial E_3}{\partial \bar{y}_3}\frac{\partial \bar{y}_3}{\partial \bar{s}_3}\frac{\partial \bar{s}_3}{\partial \bar{s}_2}\frac{\partial \bar{s}_2}{\partial W_s}+\\ &\phantom{000}\frac{\partial E_3}{\partial \bar{y}_3}\frac{\partial \bar{y}_3}{\partial \bar{s}_3}\frac{\partial \bar{s}_3}{\partial \bar{s}_2}\frac{\partial \bar{s}_2}{\partial \bar{s}_1}\frac{\partial \bar{s}_1}{\partial W_s}\ \ \ \ \blacksquare\\\\ \end{align}\)

General Formula

The partial derivatives of error, specifically at time \(t=3\), with respect to every weight matrix :

\[\begin{align} \frac{\partial E_N}{\partial W_y} &= \frac{\partial E_N}{\partial \bar{y}_N}\frac{\partial \bar{y}_N}{\partial W_y}\\\\ \frac{\partial E_N}{\partial W_s} &= \sum_{i=1}^{N}\frac{\partial E_N}{\partial \bar{y}_N}\frac{\partial \bar{y}_N}{\partial \bar{s}_i}\frac{\partial \bar{s}_i}{\partial W_s}\\\\ \frac{\partial E_N}{\partial W_x} &= \sum_{i=1}^{N}\frac{\partial E_N}{\partial \bar{y}_N}\frac{\partial \bar{y}_N}{\partial \bar{s}_i}\frac{\partial \bar{s}_i}{\partial W_x}\\ \end{align}\]

“Proof.”

Generalizing the results from the sample calculation it is clear that:

In the general formula for partial derivative of E at any time \(N\), the latter two derivatives of the summation term (\(\frac{\partial \bar{y}_N}{\partial \bar{s}_i}\frac{\partial \bar{s}_i}{\partial W_s}\)) can be expanded using chain rule to partial derivatives of every state in terms of previous state. \(\blacksquare\)

Generalizing to Other Weights

Theory In Application

To avoid calculating the partial derivatives for each state over and over (since after you pass time interval it is static):

  1. For each time point \(i\) (and corresponding state, \(\bar{s}_i\)):
    1. Calculate \(\frac{\partial \bar{s}_i}{\partial W_s}\)
    2. Calculate \(\frac{\partial \bar{s}_i}{\partial \bar{s}_{i-1}}\)
    3. Store calculations
    4. Multiply/add the stored partial derivatives, and current partial derivatives (using chain rule) to get appropriate \(\frac{\partial E}{\partial W_s}\) at current time \(i\)

Multi-RNN

If you have an RNN feeding into another RNN memory block:

ml-notes_26

MiniBatch Gradient Descent

Addressing Weaknesses of Simple RNN

Differences Between Simple RNN and LSTM

LSTM is simple RNN where all the neurons are replaced with different internal structure, but the unfolded representation is the same, i.e. one state can pass to future states

Simple RNN neuron:

LSTM neuron:

LSTM Advantages:

LSTMs

Basic Architecture

Intuitive Understanding of LSTM \(\sigma, \text{tanh}, \times, +\)

Learn Gate

  1. compute information vector (\(N_t\)): combine vectors of event (\(E_t\)) and STM (\(STM_{t-1}\)), multiply by weights (\(W_n\)), add bias (\(b_n\)), apply \(\text{tanh()}\)
  2. compute ignore factor (\(i_t\)): combine vectors of event (\(E_t\)) and STM (\(STM_{t-1}\)), multiply by weights (\(W_i\)), adds bias (\(b_i\)), apply sigmoid (\(sigma()\)) to squash between 0-1
  3. multiply \(N_t\cdot i_t\) element-wise : to ignore irrelevant information, decide what to keep ml-notes_31

Forget Gate

  1. compute forget factor (\(f_t\)): combine vectors of event (\(E_t\)) and STM (\(STM_{t-1}\)), multiply by weights (\(W_f\)), adds bias (\(b_f\)), apply sigmoid (\(sigma()\)) to squash between 0-1
  2. multiply \(LTM_{t-1}\cdot f_t\) : to decide what LTM to forget ml-notes_32

Remember Gate

  1. add output of Learn and Forget Gates ml-notes_33

Use Gate

  1. compute useful information of output of Forget Gate (\(U_t\)) : apply mini-NN to output of Forget Gate
  2. compute useful information from STM and event (\(V_t\)) : apply mini-NN to \(STM_{t-1},E_t\)
  3. multiply \(U_t \cdot V_t\) ml-notes_34

LSTM Variations