Lecture plan

  1. Deep learning
  2. Feed-forward neural networks
  3. Recurrent neural networks

What is Deep Learning (DL)?

A machine learning subfield of learning representations of data. Exceptional effective at learning patterns.

Deep learning algorithms attempt to learn (multiple levels of) representation by using a hierarchy of multiple layers.

Deep learning vs neural networks

  • Deep learning is only “deep” neural networks, such that with multiple (>2) layers.

Deep learning architechtures

  • Feed-forward neural networks
  • Convolutional neural networks
  • Recurrent neural networks
  • Self-organizing maps
  • Autoencoders
  • Transformers: Large Language Models (LLMs)

Feed-forward neural networks

  • A typical multi-layer network consists of an input, hidden and output layer, each fully connected to the next, with activation feeding forward.

  • The weights determine the function computed.

Feed-forward neural networks

\[h = \sigma(W_1x + b_1)\] \[y = \sigma(W_2h + b_2)\]

Feed-forward neural networks

One forward pass

Hidden unit representations

  • Trained hidden units can be seen as newly constructed features that make the target concept linearly separable in the transformed space.
  • On many real domains, hidden units can be interpreted as representing meaningful features such as vowel detectors or edge detectors, etc..
  • However, the hidden layer can also become a distributed representation of the input in which each individual unit is not easily interpretable as a meaningful feature.

Overfitting

 

Learned hypothesis may fit the training data very well, even outliers ( noise) but fail to generalize to new examples (test data)

How to avoid overfitting?

Overfitting prevention

  • Running too many epochs can result in over-fitting.

  • Keep a hold-out validation set and test accuracy on it after every epoch. Stop training when additional epochs actually increase validation error.
  • To avoid losing training data for validation:
    • Use internal K-fold CV on the training set to compute the average number of epochs that maximizes generalization accuracy.
    • Train final network on complete training set for this many epochs.

Regularization

Dropout
  • Randomly drop units (along with their connections) during training
  • Each unit retained with fixed probability \(p\), independent of other units
  • Hyper-parameter \(p\) to be chosen (tuned)

L2 = weight decay
  • Regularization term that penalizes big weights, added to the objective \(J_{reg}(\theta) = J(\theta) + \lambda\sum_k{\theta_k^2}\)
  • Weight decay value determines how dominant regularization is during gradient computation
  • Big weight decay coefficient &rarr big penalty for big weights
Early-stopping
  • Use validation error to decide when to stop training
  • Stop when monitored quantity has not improved after n subsequent epochs
  • \(n\) is called patience

Recurrent Neural Networks

Recurrent Neural Network (RNN)

  • Add feedback loops where some units’ current outputs determine some future network inputs.
  • RNNs can model dynamic finite-state machines, beyond the static combinatorial circuits modeled by feed-forward networks.

Simple Recurrent Network (SRN)

  • Initially developed by Jeff Elman (“Finding structure in time,” 1990).
  • Additional input to hidden layer is the state of the hidden layer in the previous time step.

Unrolled RNN

  • Behavior of RNN is perhaps best viewed by “unrolling” the network over time.

Training RNNs

  • RNNs can be trained using “backpropagation through time.”
  • Can viewed as applying normal backprop to the unrolled network.

Long Short Term Memory (LSTM)

  • LSTM networks, add additional gating units in each memory cell.
    • Forget gate
    • Input gate
    • Output gate
  • Prevents vanishing/exploding gradient problem and allows network to retain state information over longer periods of time.

LSTM network architecture

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

In R

# Use Keras Functional API 
input <- layer_input(shape = list(maxlen), name = "input")

model <- input %>%
  layer_embedding(input_dim = max_words, output_dim = dim_size, input_length = maxlen,
                  weights = list(word_embeds), trainable = FALSE) %>%
  layer_lstm(units = 80, return_sequences = TRUE)

output <- model                 %>%
  layer_global_max_pooling_1d() %>%
  layer_dense(units = 1, activation = "sigmoid")

model <- keras_model(input, output)

summary(model)

In R

In R

# instead of accuracy we can use "AUC" metrics from "tensorflow.keras"
model %>% compile(
  optimizer = "adam", 
  loss = "binary_crossentropy",
  metrics = tensorflow::tf$keras$metrics$AUC() # metrics = c('accuracy')
)

In R

history <- model %>% keras::fit(
  x_train, y_train,
  epochs = 10,
  batch_size = 32,
  validation_split = 0.2
)

Transformers

Transformers

Contextual Word Embeddings

Transformers

Transformers

Transformer foundation models:
BERT, GPT, BART

BERT: Bidirectional Encoder Representations from Tranformers

BERT: Bidirectional Encoder Representations from Tranformers

Transformers

ChatGPT (5-min exercise)

  • Go to https://chat.openai.com/ and login

  • How many parameters has chatgpt-3 model been trained on?

  • How many parameters has chatgpt-4 model been trained on?

  • What is the next generation NLP?

  • Suppose we want to build an application to help a user buy a car from textual catalogues. The user looks for any car cheaper than $10,000.00. Assume we are using the following data: txt <- c(“Price of Tesla S is $8599.99.”, “Audi Q4 is $7000.”, “BMW X5 costs $900”). Could you give me a regular expression to do this in R?

Summary

Summary

  • Deep learning
  • Feed-forward neural networks
  • Recurrent neural networks
  • State-of-the-art LLMs

Practical 8