Deep Learning for Text

Lecture plan

Deep learning
Feed-forward neural networks
Recurrent neural networks

What is Deep Learning (DL)?

A machine learning subfield of learning representations of data. Exceptional effective at learning patterns.

Deep learning algorithms attempt to learn (multiple levels of) representation by using a hierarchy of multiple layers.

Deep learning vs neural networks

Deep learning is only “deep” neural networks, such that with multiple (>2) layers.

Deep learning architechtures

Feed-forward neural networks
Convolutional neural networks
Recurrent neural networks
Self-organizing maps
Autoencoders
Transformers: Large Language Models (LLMs)

Feed-forward neural networks

A typical multi-layer network consists of an input, hidden and output layer, each fully connected to the next, with activation feeding forward.

The weights determine the function computed.

Feed-forward neural networks

\[h = \sigma(W_1x + b_1)\] \[y = \sigma(W_2h + b_2)\]

Feed-forward neural networks

One forward pass

Hidden unit representations

Trained hidden units can be seen as newly constructed features that make the target concept linearly separable in the transformed space.
On many real domains, hidden units can be interpreted as representing meaningful features such as vowel detectors or edge detectors, etc..
However, the hidden layer can also become a distributed representation of the input in which each individual unit is not easily interpretable as a meaningful feature.

Overfitting

Learned hypothesis may fit the training data very well, even outliers ( noise) but fail to generalize to new examples (test data)

How to avoid overfitting?

Overfitting prevention

Running too many epochs can result in over-fitting.

Keep a hold-out validation set and test accuracy on it after every epoch. Stop training when additional epochs actually increase validation error.
To avoid losing training data for validation:
- Use internal K-fold CV on the training set to compute the average number of epochs that maximizes generalization accuracy.
- Train final network on complete training set for this many epochs.

Regularization

Dropout

Randomly drop units (along with their connections) during training
Each unit retained with fixed probability $p$, independent of other units
Hyper-parameter $p$ to be chosen (tuned)

L2 = weight decay

Regularization term that penalizes big weights, added to the objective $J_{reg}(\theta) = J(\theta) + \lambda\sum_k{\theta_k^2}$
Weight decay value determines how dominant regularization is during gradient computation
Big weight decay coefficient &rarr big penalty for big weights

Early-stopping

Use validation error to decide when to stop training
Stop when monitored quantity has not improved after n subsequent epochs
$n$ is called patience

Recurrent Neural Networks

Recurrent Neural Network (RNN)

Add feedback loops where some units’ current outputs determine some future network inputs.
RNNs can model dynamic finite-state machines, beyond the static combinatorial circuits modeled by feed-forward networks.

Simple Recurrent Network (SRN)

Initially developed by Jeff Elman (“Finding structure in time,” 1990).
Additional input to hidden layer is the state of the hidden layer in the previous time step.

Unrolled RNN

Behavior of RNN is perhaps best viewed by “unrolling” the network over time.

Training RNNs

RNNs can be trained using “backpropagation through time.”
Can viewed as applying normal backprop to the unrolled network.

Long Short Term Memory (LSTM)

LSTM networks, add additional gating units in each memory cell.
- Forget gate
- Input gate
- Output gate
Prevents vanishing/exploding gradient problem and allows network to retain state information over longer periods of time.

LSTM network architecture

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

In R

# Use Keras Functional API 
input <- layer_input(shape = list(maxlen), name = "input")

model <- input %>%
  layer_embedding(input_dim = max_words, output_dim = dim_size, input_length = maxlen,
                  weights = list(word_embeds), trainable = FALSE) %>%
  layer_lstm(units = 80, return_sequences = TRUE)

output <- model                 %>%
  layer_global_max_pooling_1d() %>%
  layer_dense(units = 1, activation = "sigmoid")

model <- keras_model(input, output)

summary(model)

In R

# instead of accuracy we can use "AUC" metrics from "tensorflow.keras"
model %>% compile(
  optimizer = "adam", 
  loss = "binary_crossentropy",
  metrics = tensorflow::tf$keras$metrics$AUC() # metrics = c('accuracy')
)

In R

history <- model %>% keras::fit(
  x_train, y_train,
  epochs = 10,
  batch_size = 32,
  validation_split = 0.2
)

Transformers

Contextual Word Embeddings

Transformers

A transformer adopts an encoder-decoder architecture.
Transformers were developed to solve the problem of sequence transduction, or neural machine translation. That means any task that transforms an input sequence to an output sequence.
More details on the architecture and implementation:

Transformer foundation models:
BERT, GPT, BART

BERT: Bidirectional Encoder Representations from Tranformers

Transformers

ChatGPT: https://chat.openai.com/
Write with Transformer: https://transformer.huggingface.co/
Talk to Transformer: https://app.inferkit.com/demo
Transformer model for language understanding: https://www.tensorflow.org/text/tutorials/transformer
Pre-trained models: https://huggingface.co/transformers/pretrained_models.html

ChatGPT (5-min exercise)

Go to https://chat.openai.com/ and login
How many parameters has chatgpt-3 model been trained on?
How many parameters has chatgpt-4 model been trained on?
What is the next generation NLP?
Suppose we want to build an application to help a user buy a car from textual catalogues. The user looks for any car cheaper than $10,000.00. Assume we are using the following data: txt <- c(“Price of Tesla S is $8599.99.”, “Audi Q4 is $7000.”, “BMW X5 costs $900”). Could you give me a regular expression to do this in R?

Summary

Deep learning
Feed-forward neural networks
Recurrent neural networks
State-of-the-art LLMs

Lecture plan

What is Deep Learning (DL)?

Deep learning vs neural networks

Deep learning architechtures

Feed-forward neural networks

Feed-forward neural networks

Feed-forward neural networks

One forward pass

Hidden unit representations

Overfitting

Overfitting prevention

Regularization

Recurrent Neural Networks

Recurrent Neural Network (RNN)

Simple Recurrent Network (SRN)

Unrolled RNN

Training RNNs

Long Short Term Memory (LSTM)

LSTM network architecture

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

In R

In R

In R

In R

Transformers

Transformers

Contextual Word Embeddings

Transformers

Transformers

Transformer foundation models: BERT, GPT, BART

BERT: Bidirectional Encoder Representations from Tranformers

BERT: Bidirectional Encoder Representations from Tranformers

Transformers

ChatGPT (5-min exercise)

Summary

Summary

Practical 8

Transformer foundation models:
BERT, GPT, BART