Deep Learning for Text 2

Recap: RNN in Python

Lecture plan

Convolutional Neural Networks
Transformers
BERT

Convolutional Neural Network (CNN)

Intuition: Neural network with specialized connectivity structure
- Stacking multiple layers of feature extractors, low-level layers extract local features, and high-level layers extract learn global patterns.
There are a few distinct types of layers:
- Convolution Layer: detecting local features through filters (discrete convolution)
- Pooling Layer: merging similar features

Convolution layer

The core layer of CNNs
Convolutional layer consists of a set of filters
Each filter covers a spatially small portion of the input data
Each filter is convolved across the dimensions of the input data, producing a multidimensional feature map.
As we convolve the filter, we are computing the dot product between the parameters of the filter and the input.
Deep Learning algorithm: During training, the network corrects errors and filters are learned, e.g., in Keras, by adjusting weights based on Stochastic Gradient Descent, SGD.
The key architectural characteristics of the convolutional layer is local connectivity and shared weights.

Convolution without padding

Convolution with padding

4x4 input. 3x3 filter. Stride = 1. 2x2 output.

5x5 input. 3x3 filter. Stride = 1. 5x5 output.

https://github.com/vdumoulin/conv_arithmetic

Pooling layer

Intuition: to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting
Pooling partitions the input image (or documents) into a set of non-overlapping rectangles (n-grams) and, for each such sub-region, outputs the maximum value of the features in that region.

Pooling (down sampling)

Convolutional neural network

For processing data with a grid-like or array topology:

1-D convolution: text data, sequence data, time-series data, sensor signal data
2-D convolution: image data
3-D convolution: video data

Other layers

The convolution, and pooling layers are typically used as a set. Multiple sets of the above layers can appear in a CNN design.
After a few sets, the output is typically sent to one or two fully connected layers.
- A fully connected layer is a ordinary neural network layer as in other neural networks.
- Typical activation function is the sigmoid function.
- Output is typically class (classification) or real number (regression).

Other layers

The final layer of a CNN is determined by the research task.
Classification: Softmax Layer \[P(y=j|\boldsymbol{x}) = \frac{e^{w_j \cdot x}}{\sum_{k=1}^K{e^{w_k \cdot x}}}\]
- The outputs are the probabilities of belonging to each class.
Regression: Linear Layer \[f(\boldsymbol{x}) = \boldsymbol{w} \cdot \boldsymbol{x}\]
- The output is a real number.

What hyperparameters do we have in a CNN model?

CNN for Text

CNN

Main CNN idea for text:

Compute vectors for n-grams and group them afterwards

Example: “Utrecht summer school is in Utrecht” compute vectors for:

Utrecht summer, summer school, school is, is in, in Utrecht, Utrecht summer school, summer school is, school is in, is in Utrecht, Utrecht summer school is, summer school is in, school is in Utrecht, Utrecht summer school is in, summer school is in Utrecht, Utrecht summer school is in Utrecht

CNNs for sentence classification

https://arxiv.org/pdf/1408.5882.pdf

Data sets (1)

MR: Movie reviews with one sentence per review. Classification involves detecting positive/negative reviews (Pang and Lee, 2005). url: https://www.cs.cornell.edu/people/pabo/movie-review-data/
SST-1: Stanford Sentiment Treebank—an extension of MR but with train/dev/test splits provided and fine-grained labels (very positive, positive, neutral, negative, very negative), re-labeled by Socher et al. (2013). url: https://nlp.stanford.edu/sentiment/
SST-2: Same as SST-1 but with neutral reviews removed and binary labels.
Subj: Subjectivity dataset where the task is to classify a sentence as being subjective or objective (Pang and Lee, 2004).

Data sets (2)

TREC: TREC question dataset—task involves classifying a question into 6 question types (whether the question is about person, location, numeric information, etc.) (Li and Roth, 2002). url: https://cogcomp.seas.upenn.edu/Data/QA/QC/
CR: Customer reviews of various products (cameras, MP3s etc.). Task is to predict positive/negative reviews (Hu and Liu, 2004). url: https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
MPQA: Opinion polarity detection subtask of the MPQA dataset (Wiebe et al., 2005). url: https://mpqa.cs.pitt.edu/corpora/mpqa_corpus/

Datasets’ statistics

CNN variations

Similar words

Results

CNN with Keras in Python

Transformers

Contextual Word Embeddings

Transformers

A transformer adopts an encoder-decoder architecture.
Transformers were developed to solve the problem of sequence transduction, or neural machine translation. That means any task that transforms an input sequence to an output sequence.
More details on the architecture and implementation:

The Transformer Encoder-Decoder

Transformers

Transformer foundation models:
BERT, GPT, BART

BERT: Bidirectional Encoder Representations from Tranformers

Transformers for OTHER languages

Transformers

ChatGPT: https://chat.openai.com/
Write with Transformer: https://transformer.huggingface.co/
Talk to Transformer: https://app.inferkit.com/demo
Transformer model for language understanding: https://www.tensorflow.org/text/tutorials/transformer
Pretrained models: https://huggingface.co/transformers/pretrained_models.html

ChatGPT (5-min exercise)

Go to https://chat.openai.com/ and login
How many hyperparameters has chatgpt-3 model been trained on?
How many hyperparameters has chatgpt-4 model been trained on?
What is the next generation NLP?
Build a neural network model with an LSTM layer of 100 units in Keras. As before, the first layer should be an embedding layer, then the LSTM layer, a Dense layer, and the output Dense layer for the 5 news categories. Compile the model and print its summary.
Can you make it functional keras?

Summary

Convolutional Neural Networks
A transformer is a type of model architecture, while a large language model (LLM) refers to a model that is typically built using such architectures and is trained on a large corpus of text.
“Small” models like BERT have become general tools in a wide range of settings
GPT-3 has 175 billion parameters
These models are still not well-understood

Recap: RNN in Python

Lecture plan

Convolutional Neural Network (CNN)

Convolution layer

Convolution without padding

Convolution with padding

Pooling layer

Pooling (down sampling)

Convolutional neural network

Other layers

Other layers

CNN for Text

CNN

CNNs for sentence classification

Data sets (1)

Data sets (2)

Datasets’ statistics

CNN variations

Similar words

Results

CNN with Keras in Python

Transformers

Contextual Word Embeddings

Transformers

Transformers

The Transformer Encoder-Decoder

The Transformer Encoder-Decoder

The Transformer Encoder-Decoder

Transformers

Transformer foundation models: BERT, GPT, BART

BERT: Bidirectional Encoder Representations from Tranformers

BERT: Bidirectional Encoder Representations from Tranformers

Transformers for OTHER languages

Transformers

ChatGPT (5-min exercise)

Summary

Summary

Practical 7

Transformer foundation models:
BERT, GPT, BART