Recap: RNN in Python

Lecture plan

  1. Convolutional Neural Networks
  2. Transformers
  3. BERT

Convolutional Neural Network (CNN)

  • Intuition: Neural network with specialized connectivity structure
    • Stacking multiple layers of feature extractors, low-level layers extract local features, and high-level layers extract learn global patterns.
  • There are a few distinct types of layers:
    • Convolution Layer: detecting local features through filters (discrete convolution)
    • Pooling Layer: merging similar features

Convolution layer

  • The core layer of CNNs
  • Convolutional layer consists of a set of filters
  • Each filter covers a spatially small portion of the input data
  • Each filter is convolved across the dimensions of the input data, producing a multidimensional feature map.
  • As we convolve the filter, we are computing the dot product between the parameters of the filter and the input.
  • Deep Learning algorithm: During training, the network corrects errors and filters are learned, e.g., in Keras, by adjusting weights based on Stochastic Gradient Descent, SGD.
  • The key architectural characteristics of the convolutional layer is local connectivity and shared weights.

Convolution without padding

Convolution with padding

Pooling layer

  • Intuition: to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting
  • Pooling partitions the input image (or documents) into a set of non-overlapping rectangles (n-grams) and, for each such sub-region, outputs the maximum value of the features in that region.

Pooling (down sampling)

Convolutional neural network

For processing data with a grid-like or array topology:

  • 1-D convolution: text data, sequence data, time-series data, sensor signal data

  • 2-D convolution: image data

  • 3-D convolution: video data

Other layers

  • The convolution, and pooling layers are typically used as a set. Multiple sets of the above layers can appear in a CNN design.
  • After a few sets, the output is typically sent to one or two fully connected layers.
    • A fully connected layer is a ordinary neural network layer as in other neural networks.
    • Typical activation function is the sigmoid function.
    • Output is typically class (classification) or real number (regression).

Other layers

  • The final layer of a CNN is determined by the research task.
  • Classification: Softmax Layer \[P(y=j|\boldsymbol{x}) = \frac{e^{w_j \cdot x}}{\sum_{k=1}^K{e^{w_k \cdot x}}}\]
    • The outputs are the probabilities of belonging to each class.
  • Regression: Linear Layer \[f(\boldsymbol{x}) = \boldsymbol{w} \cdot \boldsymbol{x}\]
    • The output is a real number.

What hyperparameters do we have in a CNN model?

CNN for Text

CNN

Main CNN idea for text:

Compute vectors for n-grams and group them afterwards



Example: “Utrecht summer school is in Utrecht” compute vectors for:

Utrecht summer, summer school, school is, is in, in Utrecht, Utrecht summer school, summer school is, school is in, is in Utrecht, Utrecht summer school is, summer school is in, school is in Utrecht, Utrecht summer school is in, summer school is in Utrecht, Utrecht summer school is in Utrecht

CNNs for sentence classification

Data sets (1)

  • MR: Movie reviews with one sentence per review. Classification involves detecting positive/negative reviews (Pang and Lee, 2005). url: https://www.cs.cornell.edu/people/pabo/movie-review-data/

  • SST-1: Stanford Sentiment Treebank—an extension of MR but with train/dev/test splits provided and fine-grained labels (very positive, positive, neutral, negative, very negative), re-labeled by Socher et al. (2013). url: https://nlp.stanford.edu/sentiment/

  • SST-2: Same as SST-1 but with neutral reviews removed and binary labels.

  • Subj: Subjectivity dataset where the task is to classify a sentence as being subjective or objective (Pang and Lee, 2004).

Data sets (2)

Datasets’ statistics

CNN variations

Similar words

Results

CNN with Keras in Python

Contextual Word Embeddings
&
Transformers

Contextual Word Embeddings

Transformers

The Transformer Encoder-Decoder

The Transformer Encoder-Decoder

The Transformer Encoder-Decoder

Transformers

BERT: Bidirectional Encoder Representations from Tranformers

BERT: Bidirectional Encoder Representations from Tranformers

What kinds of things does pretraining learn?

There’s increasing evidence that pretrained models learn a wide variety of things about the statistical properties of language:

Talk to Transformer: https://app.inferkit.com/demo

  • Utrecht University is located in …

Transformers

Transformers

Transformers

What kinds of things does pretraining learn?

There’s increasing evidence that pretrained models learn a wide variety of things about the statistical properties of language:

  • Basic arithmetic: I was thinking about the sequence that goes 1, 1, 2, 3, 5, 8, 13, 21, …

Transformers

What kinds of things does pretraining learn?

There’s increasing evidence that pretrained models learn a wide variety of things about the statistical properties of language:

  • Reasoning: Garry went into the kitchen to make some tea. Standing next to Garry, Carrie pondered her destiny. Carrie left the …

Transformers

Transformers

Transformers

Transformers

ChatGPT (5-min exercise)

  • Go to https://chat.openai.com/ and login

  • How many hyperparameters has chatgpt-3 model been trained on?

  • How many hyperparameters has chatgpt-4 model been trained on?

  • What is the next generation NLP?

  • Build a neural network model with an LSTM layer of 100 units in Keras. As before, the first layer should be an embedding layer, then the LSTM layer, a Dense layer, and the output Dense layer for the 5 news categories. Compile the model and print its summary.

  • Can you make it functional keras?

Summary

Summary

  • Convolutional Neural Networks

  • Transformers

    • “Small” models like BERT have become general tools in a wide range of settings
    • GPT-3 has 175 billion parameters
  • These models are still not well-understood

Practical 7