Deep Learning for Text 1

Lecture plan

Language modeling
Feed-forward neural networks
Recurrent neural networks

What is a Language Model?

A Language Model is a statistical tool that uses algorithms to predict the next word or sequence of words in a sentence. These models are fundamental in natural language processing (NLP) and are used to understand and generate human language.

Key Points:

Predictive Power: Language models can predict the next word based on the context of previous words.
Applications: Used in machine translation, speech recognition, text generation, and more.
Training Data: Trained on large datasets containing text, learning the structure and nuances of the language.

Types of Models:

n-gram Models: Predict words based on the previous ‘n’ words.
Neural Network Models: Use deep learning to understand and generate text, such as GPT (Generative Pre-trained Transformer).

source: http://web.stanford.edu/class/cs224n/

Deep Learning

What is Deep Learning?

A machine learning subfield of learning representations of data. Exceptional effective at learning patterns.

Deep learning algorithms attempt to learn (multiple levels of) representation by using a hierarchy of multiple layers.

Feed-forward neural networks

A typical multi-layer network consists of an input, hidden and output layer, each fully connected to the next, with activation feeding forward.

The weights determine the function computed.

Feed-forward neural networks

\[h = \sigma(W_1x + b_1)\] \[y = \sigma(W_2h + b_2)\]

Feed-forward neural networks

One forward pass

Training

https://medium.com/@ramrajchandradevan/the-evolution-of-gradient-descend-optimization-algorithm-4106a6702d39

Optimize objective/cost function \(J\)\((\theta)\)

Generate error signal that measures difference between predictions and target values

Use error signal to change the weights and get more accurate predictions

Subtracting a fraction of the gradient moves you towards the (local) minimum of the cost function

Updating weights

objective/cost function \(J\)\((\theta)\)

Update each element of \(\theta\):

\[\theta^{new}_j = \theta^{old}_j - \alpha \frac{d}{\theta^{old}_j} J(\theta)\]

Matrix notation for all parameters ( \(\alpha\): learning rate):

\[\theta^{new}_j = \theta^{old}_j - \alpha \nabla _{\theta}J(\theta)\]

Recursively apply chain rule though each node

Notes on training

Not guaranteed to converge to zero training error, may converge to local optima or oscillate indefinitely.
However, in practice, does converge to low error for many large networks on real data.
Many epochs (thousands) may be required, hours or days of training for large networks.
To avoid local-minima problems, run several trials starting with different random weights (random restarts).
- Take results of trial with lowest training set error.
- Build a committee of results from multiple trials (possibly weighting votes by training set accuracy).

Hidden unit representations

Trained hidden units can be seen as newly constructed features that make the target concept linearly separable in the transformed space.
On many real domains, hidden units can be interpreted as representing meaningful features such as vowel detectors or edge detectors, etc..
However, the hidden layer can also become a distributed representation of the input in which each individual unit is not easily interpretable as a meaningful feature.

Overfitting

Learned hypothesis may fit the training data very well, even outliers ( noise) but fail to generalize to new examples (test data)

How to avoid overfitting?

Overfitting prevention

Running too many epochs can result in over-fitting.

Keep a hold-out validation set and test accuracy on it after every epoch. Stop training when additional epochs actually increase validation error.
To avoid losing training data for validation:
- Use internal K-fold CV on the training set to compute the average number of epochs that maximizes generalization accuracy.
- Train final network on complete training set for this many epochs.

Regularization

Dropout

Randomly drop units (along with their connections) during training
Each unit retained with fixed probability \(p\), independent of other units
Hyper-parameter \(p\) to be chosen (tuned)

L2 = weight decay

Regularization term that penalizes big weights, added to the objective \(J_{reg}(\theta) = J(\theta) + \lambda\sum_k{\theta_k^2}\)
Weight decay value determines how dominant regularization is during gradient computation
Big weight decay coefficient &rarr big penalty for big weights

Early-stopping

Use validation error to decide when to stop training
Stop when monitored quantity has not improved after n subsequent epochs
\(n\) is called patience

Determining the best
number of hidden units

Too few hidden units prevents the network from adequately fitting the data.
Too many hidden units can result in over-fitting.

Use internal cross-validation to empirically determine an optimal number of hidden units.
Hyperparameter tuning

A Neural Network Playground

What are the hyperparameters?

https://playground.tensorflow.org/

Recurrent Neural Networks

Recurrent Neural Network (RNN)

Another architecture of NN
RNN for LM
Add feedback loops where some units’ current outputs determine some future network inputs.
RNNs can model dynamic finite-state machines, beyond the static combinatorial circuits modeled by feed-forward networks.

Simple Recurrent Network (SRN)

Initially developed by Jeff Elman (“Finding structure in time,” 1990).
Additional input to hidden layer is the state of the hidden layer in the previous time step.

Unrolled RNN

Behavior of RNN is perhaps best viewed by “unrolling” the network over time.

LSTM

Vanishing gradient problem

Long Short Term Memory

LSTM networks, add additional gating units in each memory cell.
- Forget gate
- Input gate
- Output gate
Prevents vanishing/exploding gradient problem and allows network to retain state information over longer periods of time.

LSTM network architecture

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

–>

–> –> –>

–>

–> –>

–> –> –>

–>

–> –> –>

–>

–> –> –> –> –> –>

–>

–> –> –>

–> –> –> –> –> –> –> –>

Bi-directional LSTM (Bi-LSTM)

Separate LSTMs process sequence forward and backward and hidden layers at each time step are concatenated to form the cell output.

Gated Recurrent Unit (GRU)

Alternative RNN to LSTM that uses fewer gates (Cho, et al., 2014)
- Combines forget and input gates into “update” gate.
- Eliminates cell state vector

\[z_t = \sigma(W_z \cdot [h_{t-1}, x_t])\] \[r_t = \sigma(W_r \cdot [h_{t-1}, x_t])\] \[\tilde{h}_t = tanh(W \cdot [r_t * h_{t-1}, x_t])\] \[h_t = (1 - z_t) * h_{t-1} + z_t * \tilde(h)_t\]

Attention

For many applications, it helps to add “attention” to RNNs.
Allows network to learn to attend to different parts of the input at different time steps, shifting its attention to focus on different aspects during its processing.
Used in image captioning to focus on different parts of an image when generating different parts of the output sentence.
In MT, allows focusing attention on different parts of the source sentence when generating different parts of the translation.

Summary

Language Model: A system that predicts the next word
Deep learning can be applied for automatic feature engineering
Recurrent Neural Network: A family of deep learning / neural networks that: • Take sequential input (Text) of any length; apply the same weights on each step • Can optionally produce output on each step
Recurrent Neural Network ≠ Language Model
RNNs can be used for many other things
Language modeling can be done with different models, e.g., n-grams or transformers: GPT is an LM!

Lecture plan

What is a Language Model?

Deep Learning

What is Deep Learning?

Feed-forward neural networks

Feed-forward neural networks

Feed-forward neural networks

One forward pass

Training

Updating weights

Notes on training

Hidden unit representations

Overfitting

Overfitting prevention

Regularization

Determining the best number of hidden units

A Neural Network Playground

Recurrent Neural Networks

Recurrent Neural Network (RNN)

Simple Recurrent Network (SRN)

Unrolled RNN

LSTM

Vanishing gradient problem

Vanishing gradient problem

Long Short Term Memory

LSTM network architecture

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Bi-directional LSTM (Bi-LSTM)

Gated Recurrent Unit (GRU)

Attention

Summary

Summary

Practical 6

Determining the best
number of hidden units