In this practical, we will show an example of loading pre-trained word vectors, fine-tuning them for our purpose of sentiment classification for movie reviews. First, we need to install the TensorFlow and Keras packages for R.
The TensorFlow package provides code completion and inline help for the TensorFlow API when running within the RStudio IDE. The TensorFlow API is composed of a set of Python modules that enable constructing and executing TensorFlow graphs.
For a complete installation guide for TensorFlow, see https://tensorflow.rstudio.com/installation/.
To get started, we need to use the devtools package from CRAN. If you do not have it, install it first:
# This line is commented because I already have the package installed!
# install.packages("devtools")
The aim of devtools is to make package development easier by providing R functions that simplify and expedite common tasks.
Then, install the tensorflow R package from GitHub as follows:
# devtools::install_github("rstudio/tensorflow")
Then, use the install_tensorflow() function to install TensorFlow:
# tensorflow::install_tensorflow()
The provided url just installs the latest tensorflow version, you can also run this line without providing any argument to the install_tensorflow function.
Finally, you can confirm that the installation succeeded with:
library(tensorflow)
tmr <- tf$constant("Text Mining with R!")
## Loaded Tensorflow version 2.9.1
print(tmr)
## tf.Tensor(b'Text Mining with R!', shape=(), dtype=string)
This will provide you with a default installation of TensorFlow suitable for getting started with the tensorflow R package. See the article on installation (https://tensorflow.rstudio.com/installation/) to learn about more advanced options, including installing a version of TensorFlow that takes advantage of Nvidia GPUs if you have the correct CUDA libraries installed.
To install the Keras package you first run either of the following lines:
# install.packages("keras")
# devtools::install_github("rstudio/keras")
Restart RStudio, then use the line below to install keras:
# keras::install_keras()
The Keras R interface uses the TensorFlow backend engine by default. This will provide you with default CPU-based installations of Keras and TensorFlow. If you want a more customized installation, e.g. if you want to take advantage of NVIDIA GPUs, see the documentation for install_keras() and the article on installation (https://tensorflow.rstudio.com/installation/).
Now we have TensorFlow and Keras ready for fine-tuning pre-trained word embeddings for sentiment classification of movie reviews.
Remember to load the following libraries:
library(keras)
library(tidyverse)
## Warning: replacing previous import 'lifecycle::last_warnings' by
## 'rlang::last_warnings' when loading 'pillar'
## Warning: replacing previous import 'lifecycle::last_warnings' by
## 'rlang::last_warnings' when loading 'tibble'
## Warning: replacing previous import 'lifecycle::last_warnings' by
## 'rlang::last_warnings' when loading 'hms'
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.2 v dplyr 1.0.6
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(text2vec)
##
## Attaching package: 'text2vec'
## The following objects are masked from 'package:keras':
##
## fit, normalize
# Download Glove vectors if necessary and save them in your data folder
# if (!file.exists('data/glove.6B.zip')) {
# download.file('https://nlp.stanford.edu/data/glove.6B.zip', destfile = 'data/glove.6B.zip')
# unzip("data/glove.6B.zip", exdir = "data")
# }
# load glove vectors
vectors <- data.table::fread('data/glove.6B.300d.txt', data.table = F, encoding = 'UTF-8')
colnames(vectors) <- c('word',paste('dim',1:300,sep = '_'))
# vectors to dataframe
head(as_tibble(vectors))
# load an example dataset from text2vec
data("movie_review")
head(as_tibble(movie_review))
max_words <- 1e4
maxlen <- 60
dim_size <- 300
# tokenize the input data and then fit the created object
word_seqs <- text_tokenizer(num_words = max_words) %>%
fit_text_tokenizer(movie_review$review)
movie_review$review
to a sequence of integers and get indices instead of words, later pad the sequence.# apply tokenizer to the text and get indices instead of words
# later pad the sequence
x_train <- texts_to_sequences(word_seqs, movie_review$review) %>%
pad_sequences(maxlen = maxlen)
# unlist word indices
word_indices <- unlist(word_seqs$word_index)
# then place them into data.frame
dic <- data.frame(word = names(word_indices), key = word_indices, stringsAsFactors = FALSE) %>%
arrange(key) %>% .[1:max_words,]
# join the words with GloVe vectors and
# if word does not exist in GloVe, then fill NA's with 0
word_embeds <- dic %>% left_join(vectors) %>% .[,3:302] %>% replace(., is.na(.), 0) %>% as.matrix()
# extract the output
y_train <- as.matrix(movie_review$sentiment)
# Use Keras Functional API
input <- layer_input(shape = list(maxlen), name = "input")
model <- input %>%
layer_embedding(input_dim = max_words, output_dim = dim_size, input_length = maxlen,
weights = list(word_embeds), trainable = FALSE) %>%
layer_lstm(units = 80, return_sequences = TRUE)
output <- model %>%
layer_global_max_pooling_1d() %>%
layer_dense(units = 1, activation = "sigmoid")
model <- keras_model(input, output)
summary(model)
## Model: "model"
## ________________________________________________________________________________
## Layer (type) Output Shape Param # Trainable
## ================================================================================
## input (InputLayer) [(None, 60)] 0 Y
## embedding (Embedding) (None, 60, 300) 3000000 N
## lstm (LSTM) (None, 60, 80) 121920 Y
## global_max_pooling1d (GlobalM (None, 80) 0 Y
## axPooling1D)
## dense (Dense) (None, 1) 81 Y
## ================================================================================
## Total params: 3,122,001
## Trainable params: 122,001
## Non-trainable params: 3,000,000
## ________________________________________________________________________________
# instead of accuracy we can use "AUC" metrics from "tensorflow.keras"
model %>% compile(
optimizer = "adam",
loss = "binary_crossentropy",
metrics = tensorflow::tf$keras$metrics$AUC() # metrics = c('accuracy')
)
history <- model %>% keras::fit(
x_train, y_train,
epochs = 10,
batch_size = 32,
validation_split = 0.2
)
plot(history)
In this practical, we learned about:
End of Practical