In this practical, we are going to apply word embedding approaches. Here are the packages used in this practical:
library(text2vec)
library(tidyverse)
library(tidytext)
A key idea in working with text data concerns representing words as numeric quantities. There are a number of ways to go about this, and we have actually already done so. An additional method that we want to explore today is word embeddings. Word embedding techniques such as word2vec and GloVe use neural networks approaches to construct word vectors. With these vector representations of words, we can see how similar they are to each other, and perform other tasks based on that information. Here are two famous examples:
king - man + woman = queen
Paris - France + Germany = Berlin
The first part of the practical leverages the data provided in the
harrypotter package. This package supplies the first seven novels in the
Harry Potter series. You need to install the package using this line of
code:
devtools::install_github("bradleyboehmke/harrypotter")
.
This requires having the devtools package as well, which you can install
following https://www.r-project.org/nosvn/pandoc/devtools.html.
The aim of devtools
is to make your life easier by providing R functions that simplify many
common tasks.
Note that there is another harrypotter package on CRAN. This is the
package you get if you install with the regular
install.packages()
function, but we don’t want this.
You can then load the harrypotter package with the following:
#devtools::install_github("bradleyboehmke/harrypotter")
library(harrypotter)
hp_books <- c("philosophers_stone", "chamber_of_secrets",
"prisoner_of_azkaban", "goblet_of_fire",
"order_of_the_phoenix", "half_blood_prince",
"deathly_hallows")
hp_words <- list(
philosophers_stone,
chamber_of_secrets,
prisoner_of_azkaban,
goblet_of_fire,
order_of_the_phoenix,
half_blood_prince,
deathly_hallows
) %>%
# name each list element
set_names(hp_books) %>%
# convert each book to a data frame and merge into a single data frame
map_df(as_tibble, .id = "book") %>%
# convert book to a factor
mutate(book = factor(book, levels = hp_books)) %>%
# remove empty chapters
filter(!is.na(value)) %>%
# create a chapter id column
group_by(book) %>%
mutate(chapter = row_number(book))
head(hp_words)
# tokenize the data frame
hp_words <- as.data.frame(hp_words) %>%
unnest_tokens(word, value)
head(hp_words)
hp_words <- hp_words %>%
anti_join(stop_words)
head(hp_words)
hp_words_ls <- list(hp_words$word) # create list of tokenized text to pass to itoken
it <- itoken(hp_words_ls, progressbar = FALSE) # create index-tokens
hp_vocab <- create_vocabulary(it) # create vocabulary of unique terms
hp_vocab <- prune_vocabulary(hp_vocab, term_count_min = 5) # remove the words with frequency less than 5
hp_vocab
# maps words to indices
vectorizer <- vocab_vectorizer(hp_vocab)
# use window of 5 for context words
hp_tcm <- create_tcm(it, vectorizer, skip_grams_window = 5)
dim(hp_tcm) # inspect dimensions of the TCM
## [1] 8628 8628
Note that you will likely get (somewhat) different results than the ones on this practical. This is because the training of glove relies on some random processes (specifically, stochastic gradient descent). If you want to obtain completely reproducible results across different runs, you have to specify the number of threads to be 1 and set a random seed (see the comments in the relevant code chunk below). See https://github.com/dselivanov/text2vec/issues/251 for more explanation.
#for reproducible results, do:
#RcppParallel::setThreadOptions(1)
#set.seed(123)
glove <- GlobalVectors$new(rank = 50, x_max = 10) # create the model
hp_wv_main <- glove$fit_transform(hp_tcm, n_iter = 20, convergence_tol = 0.001) # fit the model to the TCM
## INFO [11:44:59.300] epoch 1, loss 0.1207
## INFO [11:44:59.538] epoch 2, loss 0.0785
## INFO [11:44:59.751] epoch 3, loss 0.0681
## INFO [11:44:59.945] epoch 4, loss 0.0619
## INFO [11:45:00.193] epoch 5, loss 0.0576
## INFO [11:45:00.431] epoch 6, loss 0.0543
## INFO [11:45:00.629] epoch 7, loss 0.0518
## INFO [11:45:00.836] epoch 8, loss 0.0498
## INFO [11:45:01.041] epoch 9, loss 0.0481
## INFO [11:45:01.253] epoch 10, loss 0.0467
## INFO [11:45:01.463] epoch 11, loss 0.0456
## INFO [11:45:01.672] epoch 12, loss 0.0445
## INFO [11:45:01.889] epoch 13, loss 0.0436
## INFO [11:45:02.091] epoch 14, loss 0.0429
## INFO [11:45:02.325] epoch 15, loss 0.0422
## INFO [11:45:02.538] epoch 16, loss 0.0415
## INFO [11:45:02.750] epoch 17, loss 0.0410
## INFO [11:45:02.965] epoch 18, loss 0.0405
## INFO [11:45:03.178] epoch 19, loss 0.0400
## INFO [11:45:03.392] epoch 20, loss 0.0396
dim(hp_wv_main) # check the dimensions
## [1] 8628 50
hp_wv_context <- glove$components # extract components
dim(hp_wv_context)
## [1] 50 8628
# Either word-vectors matrices could work, but the developers of the technique suggest the sum/mean may work better
hp_word_vectors <- hp_wv_main + t(hp_wv_context) # use the transpose of hp_wv_context to sum
dim(hp_word_vectors)
## [1] 8628 50
harry <- hp_word_vectors["harry", , drop = FALSE] # extract the row for "harry"
cos_sim_rom <- sim2(x = hp_word_vectors, y = harry, method = "cosine", norm = "l2") # pairwise similarity matrix computation
head(sort(cos_sim_rom[,1], decreasing = TRUE), 10) # see top 10 most similar words
## harry ron hermione moment time looked left happened
## 1.0000000 0.8854192 0.8784956 0.8305743 0.8244821 0.7876678 0.7752234 0.7434338
## quickly told
## 0.7352562 0.7095795
death <- hp_word_vectors["death", , drop = FALSE]
cos_sim_rom <- sim2(x = hp_word_vectors, y = death, method = "cosine", norm = "l2")
head(sort(cos_sim_rom[,1], decreasing = TRUE), 10)
## death eaters eater voldemort escaped managed killed fight
## 1.0000000 0.9396884 0.8842255 0.6746799 0.6158328 0.5658215 0.5457962 0.5436722
## sirius boy
## 0.5394298 0.5356705
love <- hp_word_vectors["love", , drop = FALSE]
cos_sim_rom <- sim2(x = hp_word_vectors, y = love, method = "cosine", norm = "l2")
head(sort(cos_sim_rom[,1], decreasing = TRUE), 10)
## love potion easy polyjuice truth forget bring dying
## 1.0000000 0.6076891 0.5821664 0.5168813 0.5108280 0.5060170 0.5046486 0.4884529
## finished telling
## 0.4832778 0.4796893
test <- hp_word_vectors["harry", , drop = FALSE] -
hp_word_vectors["death", , drop = FALSE] +
hp_word_vectors["love", , drop = FALSE]
cos_sim_test <- sim2(x = hp_word_vectors, y = test, method = "cosine", norm = "l2")
head(sort(cos_sim_test[,1], decreasing = TRUE), 10)
## hermione ron harry yeah finished told potion love
## 0.7214882 0.7040296 0.6585081 0.5949982 0.5672823 0.5526488 0.5464267 0.5296126
## thinking looked
## 0.5210628 0.5159595
text8_file <- "data/text8"
if (!file.exists(text8_file)) {
download.file("http://mattmahoney.net/dc/text8.zip", "data/text8.zip")
unzip("data/text8.zip", files = "text8", exdir = "data")
}
wiki <- readLines(text8_file, n = 1, warn = FALSE)
tokens <- space_tokenizer(wiki)
it <- itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it)
vocab <- prune_vocabulary(vocab, term_count_min = 5L)
vectorizer <- vocab_vectorizer(vocab)
tcm <- create_tcm(it, vectorizer, skip_grams_window = 5L)
#for reproducible results, do:
#RcppParallel::setThreadOptions(1)
#set.seed(123)
glove <- GlobalVectors$new(rank = 50, x_max = 10)
wv_main <- glove$fit_transform(tcm, n_iter = 20, convergence_tol = 0.001)
## INFO [11:45:46.459] epoch 1, loss 0.1746
## INFO [11:45:49.331] epoch 2, loss 0.1224
## INFO [11:45:52.204] epoch 3, loss 0.1082
## INFO [11:45:55.073] epoch 4, loss 0.1003
## INFO [11:45:57.964] epoch 5, loss 0.0952
## INFO [11:46:00.835] epoch 6, loss 0.0916
## INFO [11:46:03.775] epoch 7, loss 0.0889
## INFO [11:46:06.648] epoch 8, loss 0.0867
## INFO [11:46:09.532] epoch 9, loss 0.0850
## INFO [11:46:12.417] epoch 10, loss 0.0835
## INFO [11:46:15.276] epoch 11, loss 0.0823
## INFO [11:46:18.143] epoch 12, loss 0.0812
## INFO [11:46:20.994] epoch 13, loss 0.0803
## INFO [11:46:23.825] epoch 14, loss 0.0795
## INFO [11:46:26.676] epoch 15, loss 0.0788
## INFO [11:46:29.472] epoch 16, loss 0.0782
## INFO [11:46:32.246] epoch 17, loss 0.0776
## INFO [11:46:35.042] epoch 18, loss 0.0771
## INFO [11:46:37.812] epoch 19, loss 0.0766
## INFO [11:46:40.821] epoch 20, loss 0.0762
wv_context <- glove$components
word_vectors <- wv_main + t(wv_context)
king - man + woman = queen
Paris - France + Germany = Berlin
Again, note that you will likely get (somewhat) different results than the ones on this practical.
queen <- word_vectors["king", , drop = FALSE] -
word_vectors["man", , drop = FALSE] +
word_vectors["woman", , drop = FALSE]
queen_cos_sim <- sim2(x = word_vectors, y = queen, method = "cosine", norm = "l2")
head(sort(queen_cos_sim[,1], decreasing = TRUE), 50)
## king son henry alexander vii david
## 0.8868562 0.7432301 0.6873943 0.6550140 0.6545870 0.6531328
## father queen eldest charles iii philip
## 0.6332052 0.6329943 0.6315020 0.6284968 0.6231105 0.6221137
## whom kings viii portugal brother deposed
## 0.6215800 0.6150539 0.6141876 0.6089481 0.6078201 0.6069310
## emperor daughter prince wife married george
## 0.6023002 0.5993665 0.5965415 0.5959660 0.5955107 0.5934687
## scotland crowned judah james harrison charlemagne
## 0.5933798 0.5901516 0.5895918 0.5887537 0.5851587 0.5843726
## england vi reign xiv edward throne
## 0.5837143 0.5829379 0.5798846 0.5793959 0.5769802 0.5719025
## louis jr hungary bohemia austria met
## 0.5718839 0.5709790 0.5705086 0.5679182 0.5672911 0.5610626
## saul ii franks pope kingdom born
## 0.5605357 0.5593670 0.5591232 0.5543771 0.5505122 0.5467646
## succeeded fled
## 0.5458880 0.5458224
#check the ranking of the word queen among all the words using:
#match('queen', names(sort(queen_cos_sim[,1], decreasing = TRUE)))
berlin <- word_vectors["paris", , drop = FALSE] -
word_vectors["france", , drop = FALSE] +
word_vectors["germany", , drop = FALSE]
berlin_cos_sim <- sim2(x = word_vectors, y = berlin, method = "cosine", norm = "l2")
head(sort(berlin_cos_sim[,1], decreasing = TRUE), 5)
## paris near berlin munich at
## 0.7122870 0.6979520 0.6781777 0.6660660 0.6520105
In this practical, we learned about:
End of practical