Introduction

In this practical, we are going to apply word embedding approaches. Here are the packages used in this practical:

library(text2vec)
library(tidyverse)
library(tidytext)

Word embedding

A key idea in working with text data concerns representing words as numeric quantities. There are a number of ways to go about this, and we have actually already done so. An additional method that we want to explore today is word embeddings. Word embedding techniques such as word2vec and GloVe use neural networks approaches to construct word vectors. With these vector representations of words, we can see how similar they are to each other, and perform other tasks based on that information. Here are two famous examples:

king - man + woman = queen

Paris - France + Germany = Berlin

The first part of the practical leverages the data provided in the harrypotter package. This package supplies the first seven novels in the Harry Potter series. You need to install the package using this line of code: devtools::install_github("bradleyboehmke/harrypotter"). This requires having the devtools package as well, which you can install following https://www.r-project.org/nosvn/pandoc/devtools.html. The aim of devtools is to make your life easier by providing R functions that simplify many common tasks.

Note that there is another harrypotter package on CRAN. This is the package you get if you install with the regular install.packages() function, but we don’t want this.

You can then load the harrypotter package with the following:

#devtools::install_github("bradleyboehmke/harrypotter")
library(harrypotter)

Use the code below to load the first seven novels in the Harry Potter series. View the data sets.

hp_books <- c("philosophers_stone", "chamber_of_secrets",
              "prisoner_of_azkaban", "goblet_of_fire",
              "order_of_the_phoenix", "half_blood_prince",
              "deathly_hallows")

hp_words <- list(
  philosophers_stone,
  chamber_of_secrets,
  prisoner_of_azkaban,
  goblet_of_fire,
  order_of_the_phoenix,
  half_blood_prince,
  deathly_hallows
) %>%
  # name each list element
  set_names(hp_books) %>%
  # convert each book to a data frame and merge into a single data frame
  map_df(as_tibble, .id = "book") %>%
  # convert book to a factor
  mutate(book = factor(book, levels = hp_books)) %>%
  # remove empty chapters
  filter(!is.na(value)) %>%
  # create a chapter id column
  group_by(book) %>%
  mutate(chapter = row_number(book))

head(hp_words)

Convert the hp_words object into a dataframe and use the unnest_tokens() function from the tidytext package to tokenize the data frame.

# tokenize the data frame
hp_words <- as.data.frame(hp_words) %>%
  unnest_tokens(word, value)

head(hp_words)

Remove the stop words from the tokenized data frame.

hp_words <- hp_words %>% 
  anti_join(stop_words)

head(hp_words)

Create a vocabulary of unique terms using the create_vocabulary() function from the text2vec package and remove the words that appear less than 5 times.

hp_words_ls <- list(hp_words$word) # create list of tokenized text to pass to itoken
it <- itoken(hp_words_ls, progressbar = FALSE) # create index-tokens
hp_vocab <- create_vocabulary(it) # create vocabulary of unique terms
hp_vocab <- prune_vocabulary(hp_vocab, term_count_min = 5) # remove the words with frequency less than 5

hp_vocab

The next step is to create a token co-occurrence matrix (TCM). The definition of whether two words occur together is arbitrary. First create a vocab_vectorizer, then, use a window of 5 for context words to create the TCM.

# maps words to indices
vectorizer <- vocab_vectorizer(hp_vocab)

# use window of 5 for context words
hp_tcm <- create_tcm(it, vectorizer, skip_grams_window = 5)

dim(hp_tcm) # inspect dimensions of the TCM
## [1] 8628 8628

Use the GlobalVectors to fit the word vectors on our data set. Choose the embedding size (rank) equal to 50, and the maximum number of co-occurrences to use in the weighting function equal to 10. Train word vectors in 50 iterations if you have the (memory / cpu) resources. Also check the other input arguments of the fit_transform function here.

Note that you will likely get (somewhat) different results than the ones on this practical. This is because the training of glove relies on some random processes (specifically, stochastic gradient descent). If you want to obtain completely reproducible results across different runs, you have to specify the number of threads to be 1 and set a random seed (see the comments in the relevant code chunk below). See https://github.com/dselivanov/text2vec/issues/251 for more explanation.

#for reproducible results, do:
#RcppParallel::setThreadOptions(1)
#set.seed(123)
glove <- GlobalVectors$new(rank = 50, x_max = 10) # create the model
hp_wv_main <- glove$fit_transform(hp_tcm, n_iter = 20, convergence_tol = 0.001) # fit the model to the TCM
## INFO  [11:44:59.300] epoch 1, loss 0.1207 
## INFO  [11:44:59.538] epoch 2, loss 0.0785 
## INFO  [11:44:59.751] epoch 3, loss 0.0681 
## INFO  [11:44:59.945] epoch 4, loss 0.0619 
## INFO  [11:45:00.193] epoch 5, loss 0.0576 
## INFO  [11:45:00.431] epoch 6, loss 0.0543 
## INFO  [11:45:00.629] epoch 7, loss 0.0518 
## INFO  [11:45:00.836] epoch 8, loss 0.0498 
## INFO  [11:45:01.041] epoch 9, loss 0.0481 
## INFO  [11:45:01.253] epoch 10, loss 0.0467 
## INFO  [11:45:01.463] epoch 11, loss 0.0456 
## INFO  [11:45:01.672] epoch 12, loss 0.0445 
## INFO  [11:45:01.889] epoch 13, loss 0.0436 
## INFO  [11:45:02.091] epoch 14, loss 0.0429 
## INFO  [11:45:02.325] epoch 15, loss 0.0422 
## INFO  [11:45:02.538] epoch 16, loss 0.0415 
## INFO  [11:45:02.750] epoch 17, loss 0.0410 
## INFO  [11:45:02.965] epoch 18, loss 0.0405 
## INFO  [11:45:03.178] epoch 19, loss 0.0400 
## INFO  [11:45:03.392] epoch 20, loss 0.0396

Extract the word vectors and save the sum of them for further questions.

dim(hp_wv_main) # check the dimensions
## [1] 8628   50

hp_wv_context <- glove$components # extract components
dim(hp_wv_context)
## [1]   50 8628

# Either word-vectors matrices could work, but the developers of the technique suggest the sum/mean may work better
hp_word_vectors <- hp_wv_main + t(hp_wv_context) # use the transpose of hp_wv_context to sum

dim(hp_word_vectors)
## [1] 8628   50

Find the most similar words to words “harry”, “death”, and “love”. Use the cosine similarity measure with the function sim2. Set the norm argument to “l2”.

harry <- hp_word_vectors["harry", , drop = FALSE] # extract the row for "harry"

cos_sim_rom <- sim2(x = hp_word_vectors, y = harry, method = "cosine", norm = "l2") # pairwise similarity matrix computation
head(sort(cos_sim_rom[,1], decreasing = TRUE), 10) # see top 10 most similar words
##     harry       ron  hermione    moment      time    looked      left  happened 
## 1.0000000 0.8854192 0.8784956 0.8305743 0.8244821 0.7876678 0.7752234 0.7434338 
##   quickly      told 
## 0.7352562 0.7095795

death <- hp_word_vectors["death", , drop = FALSE]

cos_sim_rom <- sim2(x = hp_word_vectors, y = death, method = "cosine", norm = "l2")
head(sort(cos_sim_rom[,1], decreasing = TRUE), 10)
##     death    eaters     eater voldemort   escaped   managed    killed     fight 
## 1.0000000 0.9396884 0.8842255 0.6746799 0.6158328 0.5658215 0.5457962 0.5436722 
##    sirius       boy 
## 0.5394298 0.5356705

love <- hp_word_vectors["love", , drop = FALSE]

cos_sim_rom <- sim2(x = hp_word_vectors, y = love, method = "cosine", norm = "l2")
head(sort(cos_sim_rom[,1], decreasing = TRUE), 10)
##      love    potion      easy polyjuice     truth    forget     bring     dying 
## 1.0000000 0.6076891 0.5821664 0.5168813 0.5108280 0.5060170 0.5046486 0.4884529 
##  finished   telling 
## 0.4832778 0.4796893

Add the word vector of “harry” with the word vector of “love” and subtract them from the word vector of “death”. What are the top terms in your result?

test <- hp_word_vectors["harry", , drop = FALSE] -
  hp_word_vectors["death", , drop = FALSE] +
  hp_word_vectors["love", , drop = FALSE]

cos_sim_test <- sim2(x = hp_word_vectors, y = test, method = "cosine", norm = "l2")
head(sort(cos_sim_test[,1], decreasing = TRUE), 10)
##  hermione       ron     harry      yeah  finished      told    potion      love 
## 0.7214882 0.7040296 0.6585081 0.5949982 0.5672823 0.5526488 0.5464267 0.5296126 
##  thinking    looked 
## 0.5210628 0.5159595

Wikipedia word embeddings

Here we want to repeat the same analysis as for Harry Potter novel series with texts from Wikipedia. Start with the code below and train the word vectors using the wiki object.

text8_file <- "data/text8"
if (!file.exists(text8_file)) {
  download.file("http://mattmahoney.net/dc/text8.zip", "data/text8.zip")
  unzip("data/text8.zip", files = "text8", exdir = "data")
}
wiki <- readLines(text8_file, n = 1, warn = FALSE)

tokens <- space_tokenizer(wiki)
it <- itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it)
vocab <- prune_vocabulary(vocab, term_count_min = 5L)
vectorizer <- vocab_vectorizer(vocab)

tcm <- create_tcm(it, vectorizer, skip_grams_window = 5L)

#for reproducible results, do:
#RcppParallel::setThreadOptions(1)
#set.seed(123)

glove <- GlobalVectors$new(rank = 50, x_max = 10)
wv_main <- glove$fit_transform(tcm, n_iter = 20, convergence_tol = 0.001)
## INFO  [11:45:46.459] epoch 1, loss 0.1746 
## INFO  [11:45:49.331] epoch 2, loss 0.1224 
## INFO  [11:45:52.204] epoch 3, loss 0.1082 
## INFO  [11:45:55.073] epoch 4, loss 0.1003 
## INFO  [11:45:57.964] epoch 5, loss 0.0952 
## INFO  [11:46:00.835] epoch 6, loss 0.0916 
## INFO  [11:46:03.775] epoch 7, loss 0.0889 
## INFO  [11:46:06.648] epoch 8, loss 0.0867 
## INFO  [11:46:09.532] epoch 9, loss 0.0850 
## INFO  [11:46:12.417] epoch 10, loss 0.0835 
## INFO  [11:46:15.276] epoch 11, loss 0.0823 
## INFO  [11:46:18.143] epoch 12, loss 0.0812 
## INFO  [11:46:20.994] epoch 13, loss 0.0803 
## INFO  [11:46:23.825] epoch 14, loss 0.0795 
## INFO  [11:46:26.676] epoch 15, loss 0.0788 
## INFO  [11:46:29.472] epoch 16, loss 0.0782 
## INFO  [11:46:32.246] epoch 17, loss 0.0776 
## INFO  [11:46:35.042] epoch 18, loss 0.0771 
## INFO  [11:46:37.812] epoch 19, loss 0.0766 
## INFO  [11:46:40.821] epoch 20, loss 0.0762
wv_context <- glove$components
word_vectors <- wv_main + t(wv_context)

Use the Wikipedia word embeddings and try the two famous examples below.

king - man + woman = queen

Paris - France + Germany = Berlin

Again, note that you will likely get (somewhat) different results than the ones on this practical.

queen <- word_vectors["king", , drop = FALSE] -
  word_vectors["man", , drop = FALSE] +
  word_vectors["woman", , drop = FALSE]
queen_cos_sim <- sim2(x = word_vectors, y = queen, method = "cosine", norm = "l2")
head(sort(queen_cos_sim[,1], decreasing = TRUE), 50) 
##        king         son       henry   alexander         vii       david 
##   0.8868562   0.7432301   0.6873943   0.6550140   0.6545870   0.6531328 
##      father       queen      eldest     charles         iii      philip 
##   0.6332052   0.6329943   0.6315020   0.6284968   0.6231105   0.6221137 
##        whom       kings        viii    portugal     brother     deposed 
##   0.6215800   0.6150539   0.6141876   0.6089481   0.6078201   0.6069310 
##     emperor    daughter      prince        wife     married      george 
##   0.6023002   0.5993665   0.5965415   0.5959660   0.5955107   0.5934687 
##    scotland     crowned       judah       james    harrison charlemagne 
##   0.5933798   0.5901516   0.5895918   0.5887537   0.5851587   0.5843726 
##     england          vi       reign         xiv      edward      throne 
##   0.5837143   0.5829379   0.5798846   0.5793959   0.5769802   0.5719025 
##       louis          jr     hungary     bohemia     austria         met 
##   0.5718839   0.5709790   0.5705086   0.5679182   0.5672911   0.5610626 
##        saul          ii      franks        pope     kingdom        born 
##   0.5605357   0.5593670   0.5591232   0.5543771   0.5505122   0.5467646 
##   succeeded        fled 
##   0.5458880   0.5458224
#check the ranking of the word queen among all the words using:
#match('queen', names(sort(queen_cos_sim[,1], decreasing = TRUE)))

berlin <- word_vectors["paris", , drop = FALSE] -
  word_vectors["france", , drop = FALSE] +
  word_vectors["germany", , drop = FALSE]
berlin_cos_sim <- sim2(x = word_vectors, y = berlin, method = "cosine", norm = "l2")
head(sort(berlin_cos_sim[,1], decreasing = TRUE), 5)
##     paris      near    berlin    munich        at 
## 0.7122870 0.6979520 0.6781777 0.6660660 0.6520105

Summary

In this practical, we learned about:

Word embedding
Pre-trained word vectors

End of practical

Practical 7: Word Embedding

Ayoub Bagheri

Introduction to Text Mining with R

Introduction

Word embedding

Wikipedia word embeddings

Summary