TensorFlow and Keras

In this practical, we will show an example of loading pre-trained word vectors, fine-tuning them for our purpose of sentiment classification for movie reviews. First, we need to install the TensorFlow and Keras packages for R.

The TensorFlow package provides code completion and inline help for the TensorFlow API when running within the RStudio IDE. The TensorFlow API is composed of a set of Python modules that enable constructing and executing TensorFlow graphs.

For a complete installation guide for TensorFlow, see https://tensorflow.rstudio.com/installation/.

To get started, we need to use the devtools package from CRAN. If you do not have it, install it first:

# This line is commented because I already have the package installed!
# install.packages("devtools")

The aim of devtools is to make package development easier by providing R functions that simplify and expedite common tasks.

Then, install the tensorflow R package from GitHub as follows:

# devtools::install_github("rstudio/tensorflow")

Then, use the install_tensorflow() function to install TensorFlow:

# tensorflow::install_tensorflow()

The provided url just installs the latest tensorflow version, you can also run this line without providing any argument to the install_tensorflow function.

Finally, you can confirm that the installation succeeded with:

library(tensorflow) 
tmr <- tf$constant("Text Mining with R!")

## Loaded Tensorflow version 2.9.1

print(tmr)

## tf.Tensor(b'Text Mining with R!', shape=(), dtype=string)

This will provide you with a default installation of TensorFlow suitable for getting started with the tensorflow R package. See the article on installation (https://tensorflow.rstudio.com/installation/) to learn about more advanced options, including installing a version of TensorFlow that takes advantage of Nvidia GPUs if you have the correct CUDA libraries installed.

To install the Keras package you first run either of the following lines:

# install.packages("keras")
# devtools::install_github("rstudio/keras")

Restart RStudio, then use the line below to install keras:

# keras::install_keras()

The Keras R interface uses the TensorFlow backend engine by default. This will provide you with default CPU-based installations of Keras and TensorFlow. If you want a more customized installation, e.g. if you want to take advantage of NVIDIA GPUs, see the documentation for install_keras() and the article on installation (https://tensorflow.rstudio.com/installation/).

Sentiment classification with pre-trained word vectors

Now we have TensorFlow and Keras ready for fine-tuning pre-trained word embeddings for sentiment classification of movie reviews.

Remember to load the following libraries:

library(keras)
library(tidyverse)

## Warning: replacing previous import 'lifecycle::last_warnings' by
## 'rlang::last_warnings' when loading 'pillar'

## Warning: replacing previous import 'lifecycle::last_warnings' by
## 'rlang::last_warnings' when loading 'tibble'

## Warning: replacing previous import 'lifecycle::last_warnings' by
## 'rlang::last_warnings' when loading 'hms'

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.2     v dplyr   1.0.6
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(text2vec)

## 
## Attaching package: 'text2vec'

## The following objects are masked from 'package:keras':
## 
##     fit, normalize

For this purpose, we want to use GloVe pretrained word vectors. In the data folder you find these word vectors which were trained on Wikipedia 2014 and Gigaword Fifth Edition containing 6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors. You can also manually download them or use the code below for this purpose.

# Download Glove vectors if necessary and save them in your data folder
# if (!file.exists('data/glove.6B.zip')) {
#   download.file('https://nlp.stanford.edu/data/glove.6B.zip', destfile = 'data/glove.6B.zip')
#   unzip("data/glove.6B.zip", exdir = "data")
# }

Load the pre-trained word vectors from the file ‘glove.6B.300d.txt’ (if you have memory issues load the file ‘glove.6B.50d.txt’ instead).

# load glove vectors
vectors <- data.table::fread('data/glove.6B.300d.txt', data.table = F,  encoding = 'UTF-8')
colnames(vectors) <- c('word',paste('dim',1:300,sep = '_'))
# vectors to dataframe
head(as_tibble(vectors))

IMDB movie reviews is a labeled data set available in the text2vec package. This data set consists of 5000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of the reviews is binary, meaning an IMDB rating < 5 results in a sentiment score of 0, and a rating >=7 has a sentiment score of 1. No individual movie has more than 30 reviews. Load this data set and convert that to a dataframe.

# load an example dataset from text2vec
data("movie_review")
head(as_tibble(movie_review))

Define the parameters of your keras model with a maximum of 10000 words, maxlen of 60 and word embedding size of 300 (if you had memory problems change the embedding dimension to 50).

max_words <- 1e4
maxlen    <- 60
dim_size  <- 300

Use the text_tokenizer function from keras and tokenize the imdb review data using a maximum of 10000 words.

# tokenize the input data and then fit the created object
word_seqs <- text_tokenizer(num_words = max_words) %>%
  fit_text_tokenizer(movie_review$review)

Transform each text in movie_review$review to a sequence of integers and get indices instead of words, later pad the sequence.

# apply tokenizer to the text and get indices instead of words
# later pad the sequence
x_train <- texts_to_sequences(word_seqs, movie_review$review) %>%
  pad_sequences(maxlen = maxlen)

Convert the word indices into a dataframe.

# unlist word indices
word_indices <- unlist(word_seqs$word_index)

# then place them into data.frame 
dic <- data.frame(word = names(word_indices), key = word_indices, stringsAsFactors = FALSE) %>%
  arrange(key) %>% .[1:max_words,]

Join the dataframe of indices of words from the IMDB reviews with GloVe pre-trained word vectors.

# join the words with GloVe vectors and
# if word does not exist in GloVe, then fill NA's with 0
word_embeds <- dic  %>% left_join(vectors) %>% .[,3:302] %>% replace(., is.na(.), 0) %>% as.matrix()

Extract the outcome variable from the sentiment column in the original dataframe and name it y_train.

# extract the output
y_train <- as.matrix(movie_review$sentiment)

Use the Keras functional API and create a neural network model as below. Can describe this model?

# Use Keras Functional API 
input <- layer_input(shape = list(maxlen), name = "input")

model <- input %>%
  layer_embedding(input_dim = max_words, output_dim = dim_size, input_length = maxlen,
                  weights = list(word_embeds), trainable = FALSE) %>%
  layer_lstm(units = 80, return_sequences = TRUE)

output <- model                 %>%
  layer_global_max_pooling_1d() %>%
  layer_dense(units = 1, activation = "sigmoid")

model <- keras_model(input, output)

summary(model)

## Model: "model"
## ________________________________________________________________________________
##  Layer (type)                  Output Shape               Param #    Trainable  
## ================================================================================
##  input (InputLayer)            [(None, 60)]               0          Y          
##  embedding (Embedding)         (None, 60, 300)            3000000    N          
##  lstm (LSTM)                   (None, 60, 80)             121920     Y          
##  global_max_pooling1d (GlobalM  (None, 80)                0          Y          
##  axPooling1D)                                                                   
##  dense (Dense)                 (None, 1)                  81         Y          
## ================================================================================
## Total params: 3,122,001
## Trainable params: 122,001
## Non-trainable params: 3,000,000
## ________________________________________________________________________________

Compile the model with an ‘adam’ optimizer, and the binary_crossentropy loss. You can choose accuracy or AUC for the metrics.

# instead of accuracy we can use "AUC" metrics from "tensorflow.keras"
model %>% compile(
  optimizer = "adam", 
  loss = "binary_crossentropy",
  metrics = tensorflow::tf$keras$metrics$AUC() # metrics = c('accuracy')
)

Fit the model with 5 or 10 epochs (iterations), batch_size = 32, and validation_split = 0.2. Check the training performance versus the validation performance.

history <- model %>% keras::fit(
  x_train, y_train,
  epochs = 10,
  batch_size = 32,
  validation_split = 0.2
)

plot(history)

Summary

In this practical, we learned about:

Word embeddings
Pre-trained word vectors
Text2vec, Keras, TensorFlow
More on Keras Training Visualization

End of Practical

Practical 8: Deep Learning

Ayoub Bagheri

Introduction to Text Mining with R

TensorFlow and Keras

Sentiment classification with pre-trained word vectors

Summary