Lecture plan

  1. Language modeling
  2. Feed-forward neural networks
  3. Recurrent neural networks

What is a Language Model?

A Language Model is a statistical tool that uses algorithms to predict the next word or sequence of words in a sentence. These models are fundamental in natural language processing (NLP) and are used to understand and generate human language.

Key Points:

  • Predictive Power: Language models can predict the next word based on the context of previous words.
  • Applications: Used in machine translation, speech recognition, text generation, and more.
  • Training Data: Trained on large datasets containing text, learning the structure and nuances of the language.

Types of Models:

  • n-gram Models: Predict words based on the previous ‘n’ words.
  • Neural Network Models: Use deep learning to understand and generate text, such as GPT (Generative Pre-trained Transformer).

Deep Learning

What is Deep Learning?

A machine learning subfield of learning representations of data. Exceptional effective at learning patterns.

Deep learning algorithms attempt to learn (multiple levels of) representation by using a hierarchy of multiple layers.

Feed-forward neural networks

  • A typical multi-layer network consists of an input, hidden and output layer, each fully connected to the next, with activation feeding forward.

  • The weights determine the function computed.

Feed-forward neural networks

\[h = \sigma(W_1x + b_1)\] \[y = \sigma(W_2h + b_2)\]

Feed-forward neural networks

One forward pass

Training

Updating weights

objective/cost function \(J\)\((\theta)\)

Update each element of \(\theta\):

\[\theta^{new}_j = \theta^{old}_j - \alpha \frac{d}{\theta^{old}_j} J(\theta)\]

Matrix notation for all parameters ( \(\alpha\): learning rate):

\[\theta^{new}_j = \theta^{old}_j - \alpha \nabla _{\theta}J(\theta)\]

Recursively apply chain rule though each node

Notes on training

  • Not guaranteed to converge to zero training error, may converge to local optima or oscillate indefinitely.
  • However, in practice, does converge to low error for many large networks on real data.
  • Many epochs (thousands) may be required, hours or days of training for large networks.
  • To avoid local-minima problems, run several trials starting with different random weights (random restarts).
    • Take results of trial with lowest training set error.
    • Build a committee of results from multiple trials (possibly weighting votes by training set accuracy).

Hidden unit representations

  • Trained hidden units can be seen as newly constructed features that make the target concept linearly separable in the transformed space.
  • On many real domains, hidden units can be interpreted as representing meaningful features such as vowel detectors or edge detectors, etc..
  • However, the hidden layer can also become a distributed representation of the input in which each individual unit is not easily interpretable as a meaningful feature.

Overfitting

 

Learned hypothesis may fit the training data very well, even outliers ( noise) but fail to generalize to new examples (test data)

How to avoid overfitting?

Overfitting prevention

  • Running too many epochs can result in over-fitting.

  • Keep a hold-out validation set and test accuracy on it after every epoch. Stop training when additional epochs actually increase validation error.
  • To avoid losing training data for validation:
    • Use internal K-fold CV on the training set to compute the average number of epochs that maximizes generalization accuracy.
    • Train final network on complete training set for this many epochs.

Regularization

Dropout
  • Randomly drop units (along with their connections) during training
  • Each unit retained with fixed probability \(p\), independent of other units
  • Hyper-parameter \(p\) to be chosen (tuned)

L2 = weight decay
  • Regularization term that penalizes big weights, added to the objective \(J_{reg}(\theta) = J(\theta) + \lambda\sum_k{\theta_k^2}\)
  • Weight decay value determines how dominant regularization is during gradient computation
  • Big weight decay coefficient &rarr big penalty for big weights
Early-stopping
  • Use validation error to decide when to stop training
  • Stop when monitored quantity has not improved after n subsequent epochs
  • \(n\) is called patience

Determining the best
number of hidden units

  • Too few hidden units prevents the network from adequately fitting the data.
  • Too many hidden units can result in over-fitting.

  • Use internal cross-validation to empirically determine an optimal number of hidden units.

  • Hyperparameter tuning

A Neural Network Playground

Recurrent Neural Networks

Recurrent Neural Network (RNN)

  • Another architecture of NN
  • RNN for LM
  • Add feedback loops where some units’ current outputs determine some future network inputs.
  • RNNs can model dynamic finite-state machines, beyond the static combinatorial circuits modeled by feed-forward networks.

Simple Recurrent Network (SRN)

  • Initially developed by Jeff Elman (“Finding structure in time,” 1990).
  • Additional input to hidden layer is the state of the hidden layer in the previous time step.

Unrolled RNN

  • Behavior of RNN is perhaps best viewed by “unrolling” the network over time.

LSTM

Vanishing gradient problem

Vanishing gradient problem

Long Short Term Memory

  • LSTM networks, add additional gating units in each memory cell.
    • Forget gate
    • Input gate
    • Output gate
  • Prevents vanishing/exploding gradient problem and allows network to retain state information over longer periods of time.

LSTM network architecture

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

–>

–> –> –>

–>

–> –>

–> –> –>

–>

–>

–> –> –>

–>

–> –> –> –> –> –>

–>

–> –> –>

–> –> –> –> –> –> –> –>

Bi-directional LSTM (Bi-LSTM)

  • Separate LSTMs process sequence forward and backward and hidden layers at each time step are concatenated to form the cell output.

Gated Recurrent Unit (GRU)

  • Alternative RNN to LSTM that uses fewer gates (Cho, et al., 2014)
    • Combines forget and input gates into “update” gate.
    • Eliminates cell state vector

\[z_t = \sigma(W_z \cdot [h_{t-1}, x_t])\] \[r_t = \sigma(W_r \cdot [h_{t-1}, x_t])\] \[\tilde{h}_t = tanh(W \cdot [r_t * h_{t-1}, x_t])\] \[h_t = (1 - z_t) * h_{t-1} + z_t * \tilde(h)_t\]

Attention

  • For many applications, it helps to add “attention” to RNNs.
  • Allows network to learn to attend to different parts of the input at different time steps, shifting its attention to focus on different aspects during its processing.
  • Used in image captioning to focus on different parts of an image when generating different parts of the output sentence.
  • In MT, allows focusing attention on different parts of the source sentence when generating different parts of the translation.

Summary

Summary

  • Language Model: A system that predicts the next word

  • Deep learning can be applied for automatic feature engineering

  • Recurrent Neural Network: A family of deep learning / neural networks that: • Take sequential input (Text) of any length; apply the same weights on each step • Can optionally produce output on each step

  • Recurrent Neural Network ≠ Language Model

  • RNNs can be used for many other things

  • Language modeling can be done with different models, e.g., n-grams or transformers: GPT is an LM!

Practical 6