Welcome to the second practical of the course “Introduction to Text Mining with R”.
In this practical, we will work with text data, cleaning tools and try some visualizations.
As always we start with the packages we are going to use. Be sure to run these lines in your session to load the proper packages before you continue. If there are packages that you have not yet installed, first install them with `install.packages().
library(tidyverse) # for data manipulation
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.4.2 ✔ purrr 1.0.1
## ✔ tibble 3.2.1 ✔ dplyr 1.1.2
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.4 ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(dplyr) # for data manipulation
library(ggplot2) # for visualization
library (SnowballC) # for stemming
library(tokenizers) # for tokenisation
library(tidytext) # for counting words
library(wordcloud) # to create pretty word clouds
## Loading required package: RColorBrewer
Before starting the exercises, you need to set your working directory to your Practicals folder. To this end, you can either create a project in RStudio and move the Rmd and data files to the project folder, or you can use the following line instead of creating a new project:
Do not forget to adjust your working directory to the right folder.
# setwd("path/to/your/data")
In this practical, we will be working with a dataset of book reviews from Goodreads. You can download the data from https://ayoubbagheri.nl/r_tm/#monday.
The data consists of two files: - goodreads_english.csv
contains only reviews in English - goodreads_mixed.csv
contains reviews in various languages
For now, we will be working with the English reviews.
read.csv function to import the
goodreads_english.csvand
goodreads_mixed.csv`
filesgoodreads_english <- read.csv("goodreads_english.csv", stringsAsFactors=FALSE)
goodreads_mixed <- read.csv("goodreads_mixed.csv", stringsAsFactors=FALSE)
We can inspect the first few lines to see what the data looks like.
head(goodreads_english)
How many rows do we have?
nrow(goodreads_english)
## [1] 10000
We will go through a few common steps in text cleaning, where we make basic replacements in a text.
Let’s take one review as an example and see what we can do.
example_text = goodreads_english[1, 'text']
example_text
## [1] "I read this book once back in 1986 while I was walking through the forests where this takes place. I then read it again in 2008. The magic of the book is in the setting - colonial days in a forest on the southern tip of Africa where the elephants hide."
tolower(example_text)
## [1] "i read this book once back in 1986 while i was walking through the forests where this takes place. i then read it again in 2008. the magic of the book is in the setting - colonial days in a forest on the southern tip of africa where the elephants hide."
The string below includes a double space, as well as a few newlines.
some_text <- paste("I loved this book!", "It was great.", "Looking forward to the sequel.", sep="\n")
cat(some_text)
## I loved this book!
## It was great.
## Looking forward to the sequel.
What’s wrong with this? When we have a “clean” string, we can easily
split the text into words by splitting on every space
" "
:
strsplit("I loved this book!", " ")
## [[1]]
## [1] "I" "loved" "this" "book!"
But with a messy string, this gets tricky:
strsplit(some_text, " ")
## [[1]]
## [1] "I" "loved" "this" "book!\nIt"
## [5] "was" "" "great.\nLooking" "forward"
## [9] "" "to" "the" "sequel."
Let’s try to clean that up. The function gsub
is the
most basic way to make replacements in a string.
gsub("bad", "good", "it was a bad book")
## [1] "it was a good book"
To remove something, we can just replace it with nothing:
gsub("not", "", "I did not like the book")
## [1] "I did like the book"
Excercise 1: try to clean up some_text
:
it should have no double spaces or newlines. To see if it works, try
using strsplit
on your result.
oneline <- gsub("\n", " ", some_text)
cleaned <- gsub(" ", " ", oneline)
cleaned
## [1] "I loved this book! It was great. Looking forward to the sequel."
strsplit(cleaned, " ")
## [[1]]
## [1] "I" "loved" "this" "book!" "It" "was" "great."
## [8] "Looking" "forward" "to" "the" "sequel."
Removing numbers or punctuation is also a kind of replacement. To replace all the numbers, we could write some code like so:
replaced <- example_text
replaced <- gsub("0", "", replaced)
replaced <-g sub("1", "", replaced)
replaced <-g sub("2", "", replaced)
# etc...
But this is way too much typing. Luckily, gsub
also
takes regular expressions, which we saw in the previous practical. If
you have trouble understanding the expressions below, don’t worry too
much about it. For now, the important thing is that regular expressions
can be used to describe patterns like “all the numbers” or
“periods, commas and hyphens”.
gsub("[0-9]", "", example_text)
## [1] "I read this book once back in while I was walking through the forests where this takes place. I then read it again in . The magic of the book is in the setting - colonial days in a forest on the southern tip of Africa where the elephants hide."
gsub("[\\.,\\-]", "", example_text)
## [1] "I read this book once back in 1986 while I was walking through the forests where this takes place I then read it again in 2008 The magic of the book is in the setting colonial days in a forest on the southern tip of Africa where the elephants hide"
Exercise 2: combine all the steps we’ve used so far
on example_text
lowercase <- tolower(example_text)
no_numbers <- gsub("[0-9]", "", lowercase)
no_punct <- gsub("[\\.,\\-]", "", no_numbers)
oneline <- gsub("\n", " ", no_punct)
clean <- gsub(" ", " ", oneline)
clean
## [1] "i read this book once back in while i was walking through the forests where this takes place i then read it again in the magic of the book is in the setting colonial days in a forest on the southern tip of africa where the elephants hide"
As you can see, there are quite a few steps involved! However, these steps are very common, and as we will see, there are packages that make this easy for us.
We already saw strsplit
as a way to split a text into
words, but that function is quite basic. From now on, we will use the
tokenizers
package to split our text into words. Let’s try
running it with its default options:
example_words <- tokenize_words(example_text)
example_words
## [[1]]
## [1] "i" "read" "this" "book" "once" "back"
## [7] "in" "1986" "while" "i" "was" "walking"
## [13] "through" "the" "forests" "where" "this" "takes"
## [19] "place" "i" "then" "read" "it" "again"
## [25] "in" "2008" "the" "magic" "of" "the"
## [31] "book" "is" "in" "the" "setting" "colonial"
## [37] "days" "in" "a" "forest" "on" "the"
## [43] "southern" "tip" "of" "africa" "where" "the"
## [49] "elephants" "hide"
As you can see, tokenize_words
does a bit more than just
tokenisation. For example, all the text is in lowercase now. Try running
help(tokenize_words)
to see all the options that the
functions offers, and adjust example_words
so it does not
include "1986"
and "2008"
.
The SnowballC
package allows us to stem words:
wordStem(c("text", "texts", "texting"))
## [1] "text" "text" "text"
The tokenize_words
function also has a handy variant,
tokenize_word_stems
, which uses the snowballC
stemmer:
tokenize_word_stems(example_text)
## [[1]]
## [1] "i" "read" "this" "book" "onc" "back"
## [7] "in" "1986" "while" "i" "was" "walk"
## [13] "through" "the" "forest" "where" "this" "take"
## [19] "place" "i" "then" "read" "it" "again"
## [25] "in" "2008" "the" "magic" "of" "the"
## [31] "book" "is" "in" "the" "set" "coloni"
## [37] "day" "in" "a" "forest" "on" "the"
## [43] "southern" "tip" "of" "africa" "where" "the"
## [49] "eleph" "hide"
tokenize_word_stems
is based on English by default. We
can get an overview with:
getStemLanguages()
## [1] "arabic" "basque" "catalan" "danish" "dutch"
## [6] "english" "finnish" "french" "german" "greek"
## [11] "hindi" "hungarian" "indonesian" "irish" "italian"
## [16] "lithuanian" "nepali" "norwegian" "porter" "portuguese"
## [21] "romanian" "russian" "spanish" "swedish" "tamil"
## [26] "turkish"
Exercise 3: Use tokenize_word_stems
on
some Dutch text. Use help
to see how you can adjust the
language of the stemmer.
example_dutch <- goodreads_mixed[14, "text"]
#help("tokenize_word_stems")
tokenize_word_stems(example_dutch, "dutch")
## [[1]]
## [1] "hoewel" "hij" "al"
## [4] "op" "veertienjar" "leeftijd"
## [7] "zijn" "eerst" "boek"
## [10] "schref" "begon" "de"
## [13] "zuid" "afrikan" "deon"
## [16] "meyer" "pas" "na"
## [19] "zijn" "dertigst" "serieus"
## [22] "over" "het" "schrijv"
## [25] "van" "boek" "na"
## [28] "te" "denk" "niet"
## [31] "vel" "later" "was"
## [34] "zijn" "eerst" "boek"
## [37] "een" "feit" "de"
## [40] "thriller" "wie" "met"
## [43] "vur" "spel" "dat"
## [46] "allen" "in" "het"
## [49] "afrikan" "is" "uitgekom"
## [52] "zijn" "later" "werk"
## [55] "werd" "allemal" "wel"
## [58] "vertaald" "mar" "echt"
## [61] "bekend" "werd" "hij"
## [64] "met" "zijn" "serie"
## [67] "waarin" "inspecteur" "bennie"
## [70] "griessel" "het" "hoofdpersonag"
## [73] "is" "de" "serie"
## [76] "begon" "met" "duivelspiek"
## [79] "en" "kreg" "een"
## [82] "vervolg" "met" "13"
## [85] "uur" "dat" "dor"
## [88] "vn" "in" "2012"
## [91] "tot" "thriller" "van"
## [94] "het" "jar" "is"
## [97] "uitgeroep" "de" "amerikan"
## [100] "vriendinn" "erin" "russel"
## [103] "en" "rachel" "anderson"
## [106] "reiz" "met" "een"
## [109] "groep" "dor" "afrika"
## [112] "in" "kaapstad" "wordt"
## [115] "een" "van" "hen"
## [118] "vermoord" "en" "de"
## [121] "ander" "achtervolgd" "inspecteur"
## [124] "bennie" "griessel" "krijgt"
## [127] "de" "leiding" "over"
## [130] "het" "onderzoek" "nar"
## [133] "de" "moord" "mar"
## [136] "moet" "er" "ook"
## [139] "vor" "zorg" "dat"
## [142] "het" "gevlucht" "meisj"
## [145] "dat" "inmiddel" "verdwen"
## [148] "is" "levend" "wordt"
## [151] "teruggevond" "ergen" "ander"
## [154] "in" "de" "stad"
## [157] "wordt" "in" "zijn"
## [160] "woning" "het" "licham"
## [163] "van" "een" "man"
## [166] "gevond" "en" "de"
## [169] "verdenk" "valt" "al"
## [172] "snel" "op" "zijn"
## [175] "alcoholistisch" "vrouw" "ook"
## [178] "dez" "zak" "zal"
## [181] "dor" "griessel" "gecoordineerd"
## [184] "moet" "word" "wat"
## [187] "hebb" "dez" "zak"
## [190] "gemen" "en" "word"
## [193] "ze" "op" "tijd"
## [196] "opgelost" "zoal" "de"
## [199] "titel" "van" "het"
## [202] "boek" "al" "suggereert"
## [205] "speelt" "het" "verhal"
## [208] "zich" "in" "dertien"
## [211] "uur" "af" "die"
## [214] "kort" "tijdspann" "zorgt"
## [217] "er" "wel" "vor"
## [220] "dat" "het" "een"
## [223] "enorm" "tempo" "heeft"
## [226] "en" "het" "verhal"
## [229] "bij" "wijz" "van"
## [232] "sprek" "voorbij" "is"
## [235] "vor" "je" "er"
## [238] "erg" "in" "hebt"
## [241] "omdat" "meyer" "zich"
## [244] "dor" "dez" "opzet"
## [247] "gen" "inleid" "kon"
## [250] "permitter" "ben" "je"
## [253] "als" "lezer" "al"
## [256] "vanaf" "de" "eerst"
## [259] "bladzijd" "bij" "het"
## [262] "verhal" "betrok" "je"
## [265] "zit" "er" "meten"
## [268] "volop" "in" "en"
## [271] "pas" "als" "het"
## [274] "boek" "is" "dichtgeslag"
## [277] "kun" "je" "op"
## [280] "adem" "kom" "daarbij"
## [283] "kenmerkt" "de" "plot"
## [286] "zich" "ook" "nog"
## [289] "een" "dor" "diver"
## [292] "onverwacht" "en" "dus"
## [295] "verrass" "ontwikkel" "het"
## [298] "zijn" "stuk" "vor"
## [301] "stuk" "ingredient" "13"
## [304] "uur" "van" "een"
## [307] "flink" "dosis" "spanning"
## [310] "hebb" "voorzien" "het"
## [313] "verhal" "bestat" "uit"
## [316] "twee" "verhaallijn" "die"
## [319] "van" "de" "moord"
## [322] "op" "het" "ene"
## [325] "en" "de" "vlucht"
## [328] "en" "verdwijn" "van"
## [331] "het" "ander" "meisj"
## [334] "en" "ook" "die"
## [337] "van" "de" "moord"
## [340] "op" "de" "man"
## [343] "die" "een" "bekend"
## [346] "platenbas" "blijkt" "te"
## [349] "zijn" "gedur" "de"
## [352] "plot" "wijst" "niet"
## [355] "erop" "dat" "beid"
## [358] "verhal" "met" "elkar"
## [361] "te" "mak" "hebb"
## [364] "dat" "hebb" "ze"
## [367] "in" "feit" "ook"
## [370] "niet" "mar" "in"
## [373] "de" "ontknop" "blijk"
## [376] "ze" "toch" "een"
## [379] "raakvlak" "te" "hebb"
## [382] "iet" "dat" "de"
## [385] "lezer" "lang" "tijd"
## [388] "niet" "ziet" "aankom"
## [391] "de" "onderzoek" "in"
## [394] "die" "afzonder" "verhal"
## [397] "word" "verricht" "dor"
## [400] "een" "aantal" "nieuwbak"
## [403] "inspecteur" "griessel" "is"
## [406] "hun" "mentor" "interessant"
## [409] "personages" "die" "als"
## [412] "ze" "ook" "nog"
## [415] "in" "de" "rest"
## [418] "van" "de" "serie"
## [421] "vor" "gan" "kom"
## [424] "allen" "mar" "kunn"
## [427] "groei" "meyer" "zou"
## [430] "meyer" "niet" "zijn"
## [433] "als" "hij" "ook"
## [436] "in" "13" "uur"
## [439] "gen" "maatschapp" "problem"
## [442] "nar" "vor" "brengt"
## [445] "dit" "problem" "is"
## [448] "niet" "prominent" "aanwez"
## [451] "mar" "het" "sluimert"
## [454] "dor" "het" "verhal"
## [457] "hen" "want" "hij"
## [460] "maakt" "hel" "erg"
## [463] "duidelijk" "dat" "het"
## [466] "land" "nog" "sted"
## [469] "te" "kamp" "heeft"
## [472] "met" "de" "gevolg"
## [475] "van" "de" "apart"
## [478] "ook" "de" "nog"
## [481] "sted" "aanwez" "corruptie"
## [484] "van" "politiefunctionariss" "maakt"
## [487] "del" "uit" "van"
## [490] "het" "verhal" "daarin"
## [493] "heeft" "het" "land"
## [496] "nog" "sted" "gen"
## [499] "positiev" "verander" "ondergan"
## [502] "dez" "problematiek" "maakt"
## [505] "het" "verhal" "bijzonder"
## [508] "realistisch" "dat" "geldt"
## [511] "overigen" "ook" "vor"
## [514] "de" "beeldend" "en"
## [517] "levend" "beschrijv" "die"
## [520] "de" "auteur" "hanteert"
## [523] "toch" "kan" "er"
## [526] "een" "klein" "wat"
## [529] "kritisch" "kantteken" "word"
## [532] "geplaatst" "een" "van"
## [535] "de" "nieuw" "inspecteur"
## [538] "mbali" "kaleni" "is"
## [541] "nogal" "for" "en"
## [544] "dar" "legt" "meyer"
## [547] "som" "onnod" "de"
## [550] "nadruk" "op" "dat"
## [553] "weegt" "echter" "niet"
## [556] "op" "teg" "het"
## [559] "verhal" "op" "zich"
## [562] "want" "dat" "stat"
## [565] "als" "een" "huis"
## [568] "zonder" "enig" "vorm"
## [571] "van" "twijfel" "kan"
## [574] "word" "gesteld" "dat"
## [577] "13" "uur" "een"
## [580] "thriller" "van" "format"
## [583] "is"
We have only worked on an single review so far, but we want to work
on our entire dataset. We will use tidytext
to work with
our data. Here we reshape our goodreads_english
dataframe:
each review is split over multiple rows, so we only have one word per
row.
goodreads_by_word <- goodreads_english %>%
unnest_tokens(word, text)
head(goodreads_by_word)
Now we can use count
to get the most common elements in
the word
column:
goodreads_counts = goodreads_by_word %>%
count(word, sort=TRUE)
goodreads_counts
Let’s try visualising our results in a wordcloud.
goodreads_counts %>%
with(wordcloud(word, n, max.words=100))
How many words are there in this dataset anyway? In text mining, we differentiate between types and tokens. The number of types tells us how many different words occur, and the number of tokens tells us how many words occur in total. For example, “to be or not to be” has 4 types (to, be, or, and not), and 6 tokens (to, be, or, not, to, and be).
Exercise 4: How many types are in
goodreads_counts
? And how many tokens?
# number of types
nrow(goodreads_counts)
## [1] 30839
#number of tokens
sum(goodreads_counts['n'])
## [1] 1326809
Exercise 5: The unnest_tokens
that we
used above takes the same arguments as tokenize_words
: you
can tweak whether you want to turn everything into lowercase and whether
to clean up punctuation or numbers. Create a new table where you adjust
one of these options, and report on how it affects the number of types
and tokens.
goodreads_count_no_lowercasing <- goodreads_english %>%
unnest_tokens(word, text, to_lower=FALSE) %>%
count(word)
print('Number of types:')
## [1] "Number of types:"
print(nrow(goodreads_count_no_lowercasing))
## [1] 36137
print('Number of tokens:')
## [1] "Number of tokens:"
print(sum(goodreads_count_no_lowercasing['n']))
## [1] 1326809
# we can see that lowercasing does not affect the number of tokens, but it did lower the number of types
The wordcloud we made in the previous section is nice, but it contains a lot of generic words like “the”, “and”, etc. Let’s try removing stopwords.
We will use get_stopwords
from the tidytext
package to get a list of stopwords in English.
stopwords_english <- get_stopwords()
stopwords_english
Now, we want to exclude these from our table of words. We can use
anti_join
, which means that we cross-reference with another
table, and remove everything that is also in the other table. Then, we
use count
again to get a new frequency table.
goodreads_without_stopwords <- goodreads_by_word %>%
anti_join(stopwords_english, by='word') %>%
count(word, sort=TRUE)
head(goodreads_without_stopwords)
This already looks a lot more informative: the most frequent words include book, read story, and characters: they tell us something about what this dataset is about. We can make another wordcloud that does not include the stopwords:
goodreads_without_stopwords %>%
with(wordcloud(word, n, max.words=100))
Exercise 6: We can use filter
to select
the reviews from a specific book, like so. Make a word cloud for the
reviews of Nineteen eighty-four and Gone girl and
compare them side-by-side. Do they look very different?
goodreads_english %>%
filter(book_title == "Nineteen eighty-four")
tidytext
provides a table parts_of_speech
which lists potential parts of speech for English words. We can use this
for a very basic way of assigning a POS to each word, by looking up each
word in the table. Here we do just that: left_join
means
that for each row in goodreads_by_word
, we look up the
corresponding row in parts_of_speech
and combine the
data.
goodreads_pos <- goodreads_by_word %>%
left_join(parts_of_speech, by="word", multiple="first")
head(goodreads_pos)
Be aware that this is a very crude way of assigning POS tags. We usually want to take the rest of the sentence into account, so we can differentiate between two sentences like:
You can see this going wrong in the table above: “read” is classified as a noun instead of a verb. However, we can accept a some inaccuracy for our purposes here.
Exercise 7: Try using filter
to only
get a table containing only adjectives. Then filter on the rating so you
can compare the adjectives used in the highest ratings (5) with those
used in the lowest ratings (1). Make a wordcloud for both.
goodreads_adjectives <- goodreads_pos %>%
filter(pos == "Adjective")
goodreads_adjectives %>%
filter(rating == 5) %>%
count(word, sort=TRUE) %>%
with(wordcloud(word, n, max.words=100))
goodreads_adjectives %>%
filter(rating == 1) %>%
count(word, sort=TRUE) %>%
with(wordcloud(word, n, max.words=100))
goodreads_mixed
contains reviews in a mix of languages.
We can see how many reviews we have for each:
goodreads_mixed %>%
count(language, sort=TRUE)
Exercise 8: Filter the Spanish reviews (or pick another language, if you like). Go through the preprocessing steps we have covered so far and visualise the results. Think about which steps would need to be adjusted based on the language, and adjust them accordingly.
# filter reviews
goodreads_spanish <- goodreads_mixed %>%
filter(language == "Spanish")
head(goodreads_spanish)
# tokenise
goodreads_spanish_words <- goodreads_spanish %>%
unnest_tokens(word, text)
head(goodreads_spanish_words)
# remove stopwords
goodreads_spanish_words_clean <- goodreads_spanish_words %>%
anti_join(get_stopwords(language="es"), by="word")
goodreads_spanish_words_clean %>%
count(word) %>%
with(wordcloud(word, n, max.words=100))