Introduction

Welcome to the second practical of the course “Introduction to Text Mining with R”.

In this practical, we will work with text data, cleaning tools and try some visualizations.

As always we start with the packages we are going to use. Be sure to run these lines in your session to load the proper packages before you continue. If there are packages that you have not yet installed, first install them with `install.packages().

library(tidyverse) # for data manipulation
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.4.2     ✔ purrr   1.0.1
## ✔ tibble  3.2.1     ✔ dplyr   1.1.2
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.4     ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(dplyr)     # for data manipulation
library(ggplot2)   # for visualization
library (SnowballC) # for stemming
library(tokenizers) # for tokenisation
library(tidytext)  # for counting words
library(wordcloud) # to create pretty word clouds
## Loading required package: RColorBrewer

Before starting the exercises, you need to set your working directory to your Practicals folder. To this end, you can either create a project in RStudio and move the Rmd and data files to the project folder, or you can use the following line instead of creating a new project:

Do not forget to adjust your working directory to the right folder.

# setwd("path/to/your/data")

Reading data: CSV files

In this practical, we will be working with a dataset of book reviews from Goodreads. You can download the data from https://ayoubbagheri.nl/r_tm/#monday.

The data consists of two files: - goodreads_english.csv contains only reviews in English - goodreads_mixed.csv contains reviews in various languages

For now, we will be working with the English reviews.

  1. Use the read.csv function to import thegoodreads_english.csvandgoodreads_mixed.csv` files
goodreads_english <- read.csv("goodreads_english.csv", stringsAsFactors=FALSE)
goodreads_mixed <- read.csv("goodreads_mixed.csv", stringsAsFactors=FALSE)

We can inspect the first few lines to see what the data looks like.

head(goodreads_english)

How many rows do we have?

nrow(goodreads_english)
## [1] 10000

Cleaning

We will go through a few common steps in text cleaning, where we make basic replacements in a text.

Let’s take one review as an example and see what we can do.

example_text = goodreads_english[1, 'text']

example_text
## [1] "I read this book once back in 1986 while I was walking through the forests where this takes place. I then read it again in 2008. The magic of the book is in the setting - colonial days in a forest on the southern tip of Africa where the elephants hide."

All words in lowercase

tolower(example_text)
## [1] "i read this book once back in 1986 while i was walking through the forests where this takes place. i then read it again in 2008. the magic of the book is in the setting - colonial days in a forest on the southern tip of africa where the elephants hide."

Clean up whitespace

The string below includes a double space, as well as a few newlines.

some_text <- paste("I loved this book!", "It was  great.", "Looking forward  to the sequel.", sep="\n")
cat(some_text)
## I loved this book!
## It was  great.
## Looking forward  to the sequel.

What’s wrong with this? When we have a “clean” string, we can easily split the text into words by splitting on every space " ":

strsplit("I loved this book!", " ")
## [[1]]
## [1] "I"     "loved" "this"  "book!"

But with a messy string, this gets tricky:

strsplit(some_text, " ")
## [[1]]
##  [1] "I"               "loved"           "this"            "book!\nIt"      
##  [5] "was"             ""                "great.\nLooking" "forward"        
##  [9] ""                "to"              "the"             "sequel."

Let’s try to clean that up. The function gsub is the most basic way to make replacements in a string.

gsub("bad", "good", "it was a bad book")
## [1] "it was a good book"

To remove something, we can just replace it with nothing:

gsub("not", "", "I did not like the book")
## [1] "I did  like the book"

Excercise 1: try to clean up some_text: it should have no double spaces or newlines. To see if it works, try using strsplit on your result.

oneline <- gsub("\n", " ", some_text)
cleaned <- gsub("  ", " ", oneline)

cleaned
## [1] "I loved this book! It was great. Looking forward to the sequel."
strsplit(cleaned, " ")
## [[1]]
##  [1] "I"       "loved"   "this"    "book!"   "It"      "was"     "great." 
##  [8] "Looking" "forward" "to"      "the"     "sequel."

Remove numbers and punctuation

Removing numbers or punctuation is also a kind of replacement. To replace all the numbers, we could write some code like so:

replaced <- example_text
replaced <- gsub("0", "", replaced)
replaced <-g sub("1", "", replaced)
replaced <-g sub("2", "", replaced)
# etc...

But this is way too much typing. Luckily, gsub also takes regular expressions, which we saw in the previous practical. If you have trouble understanding the expressions below, don’t worry too much about it. For now, the important thing is that regular expressions can be used to describe patterns like “all the numbers” or “periods, commas and hyphens”.

gsub("[0-9]", "", example_text)
## [1] "I read this book once back in  while I was walking through the forests where this takes place. I then read it again in . The magic of the book is in the setting - colonial days in a forest on the southern tip of Africa where the elephants hide."
gsub("[\\.,\\-]", "", example_text)
## [1] "I read this book once back in 1986 while I was walking through the forests where this takes place I then read it again in 2008 The magic of the book is in the setting  colonial days in a forest on the southern tip of Africa where the elephants hide"

Putting it together

Exercise 2: combine all the steps we’ve used so far on example_text

lowercase <- tolower(example_text)
no_numbers <- gsub("[0-9]", "", lowercase)
no_punct <- gsub("[\\.,\\-]", "", no_numbers)
oneline <- gsub("\n", " ", no_punct)
clean <- gsub("  ", " ", oneline)

clean
## [1] "i read this book once back in while i was walking through the forests where this takes place i then read it again in the magic of the book is in the setting colonial days in a forest on the southern tip of africa where the elephants hide"

As you can see, there are quite a few steps involved! However, these steps are very common, and as we will see, there are packages that make this easy for us.

Tokenising

We already saw strsplit as a way to split a text into words, but that function is quite basic. From now on, we will use the tokenizers package to split our text into words. Let’s try running it with its default options:

example_words <- tokenize_words(example_text)

example_words
## [[1]]
##  [1] "i"         "read"      "this"      "book"      "once"      "back"     
##  [7] "in"        "1986"      "while"     "i"         "was"       "walking"  
## [13] "through"   "the"       "forests"   "where"     "this"      "takes"    
## [19] "place"     "i"         "then"      "read"      "it"        "again"    
## [25] "in"        "2008"      "the"       "magic"     "of"        "the"      
## [31] "book"      "is"        "in"        "the"       "setting"   "colonial" 
## [37] "days"      "in"        "a"         "forest"    "on"        "the"      
## [43] "southern"  "tip"       "of"        "africa"    "where"     "the"      
## [49] "elephants" "hide"

As you can see, tokenize_words does a bit more than just tokenisation. For example, all the text is in lowercase now. Try running help(tokenize_words) to see all the options that the functions offers, and adjust example_words so it does not include "1986" and "2008".

Stemming

The SnowballC package allows us to stem words:

wordStem(c("text", "texts", "texting"))
## [1] "text" "text" "text"

The tokenize_words function also has a handy variant, tokenize_word_stems, which uses the snowballC stemmer:

tokenize_word_stems(example_text)
## [[1]]
##  [1] "i"        "read"     "this"     "book"     "onc"      "back"    
##  [7] "in"       "1986"     "while"    "i"        "was"      "walk"    
## [13] "through"  "the"      "forest"   "where"    "this"     "take"    
## [19] "place"    "i"        "then"     "read"     "it"       "again"   
## [25] "in"       "2008"     "the"      "magic"    "of"       "the"     
## [31] "book"     "is"       "in"       "the"      "set"      "coloni"  
## [37] "day"      "in"       "a"        "forest"   "on"       "the"     
## [43] "southern" "tip"      "of"       "africa"   "where"    "the"     
## [49] "eleph"    "hide"

Stemming in other languages

tokenize_word_stems is based on English by default. We can get an overview with:

getStemLanguages()
##  [1] "arabic"     "basque"     "catalan"    "danish"     "dutch"     
##  [6] "english"    "finnish"    "french"     "german"     "greek"     
## [11] "hindi"      "hungarian"  "indonesian" "irish"      "italian"   
## [16] "lithuanian" "nepali"     "norwegian"  "porter"     "portuguese"
## [21] "romanian"   "russian"    "spanish"    "swedish"    "tamil"     
## [26] "turkish"

Exercise 3: Use tokenize_word_stems on some Dutch text. Use help to see how you can adjust the language of the stemmer.

example_dutch <- goodreads_mixed[14, "text"]
#help("tokenize_word_stems")
tokenize_word_stems(example_dutch, "dutch")
## [[1]]
##   [1] "hoewel"               "hij"                  "al"                  
##   [4] "op"                   "veertienjar"          "leeftijd"            
##   [7] "zijn"                 "eerst"                "boek"                
##  [10] "schref"               "begon"                "de"                  
##  [13] "zuid"                 "afrikan"              "deon"                
##  [16] "meyer"                "pas"                  "na"                  
##  [19] "zijn"                 "dertigst"             "serieus"             
##  [22] "over"                 "het"                  "schrijv"             
##  [25] "van"                  "boek"                 "na"                  
##  [28] "te"                   "denk"                 "niet"                
##  [31] "vel"                  "later"                "was"                 
##  [34] "zijn"                 "eerst"                "boek"                
##  [37] "een"                  "feit"                 "de"                  
##  [40] "thriller"             "wie"                  "met"                 
##  [43] "vur"                  "spel"                 "dat"                 
##  [46] "allen"                "in"                   "het"                 
##  [49] "afrikan"              "is"                   "uitgekom"            
##  [52] "zijn"                 "later"                "werk"                
##  [55] "werd"                 "allemal"              "wel"                 
##  [58] "vertaald"             "mar"                  "echt"                
##  [61] "bekend"               "werd"                 "hij"                 
##  [64] "met"                  "zijn"                 "serie"               
##  [67] "waarin"               "inspecteur"           "bennie"              
##  [70] "griessel"             "het"                  "hoofdpersonag"       
##  [73] "is"                   "de"                   "serie"               
##  [76] "begon"                "met"                  "duivelspiek"         
##  [79] "en"                   "kreg"                 "een"                 
##  [82] "vervolg"              "met"                  "13"                  
##  [85] "uur"                  "dat"                  "dor"                 
##  [88] "vn"                   "in"                   "2012"                
##  [91] "tot"                  "thriller"             "van"                 
##  [94] "het"                  "jar"                  "is"                  
##  [97] "uitgeroep"            "de"                   "amerikan"            
## [100] "vriendinn"            "erin"                 "russel"              
## [103] "en"                   "rachel"               "anderson"            
## [106] "reiz"                 "met"                  "een"                 
## [109] "groep"                "dor"                  "afrika"              
## [112] "in"                   "kaapstad"             "wordt"               
## [115] "een"                  "van"                  "hen"                 
## [118] "vermoord"             "en"                   "de"                  
## [121] "ander"                "achtervolgd"          "inspecteur"          
## [124] "bennie"               "griessel"             "krijgt"              
## [127] "de"                   "leiding"              "over"                
## [130] "het"                  "onderzoek"            "nar"                 
## [133] "de"                   "moord"                "mar"                 
## [136] "moet"                 "er"                   "ook"                 
## [139] "vor"                  "zorg"                 "dat"                 
## [142] "het"                  "gevlucht"             "meisj"               
## [145] "dat"                  "inmiddel"             "verdwen"             
## [148] "is"                   "levend"               "wordt"               
## [151] "teruggevond"          "ergen"                "ander"               
## [154] "in"                   "de"                   "stad"                
## [157] "wordt"                "in"                   "zijn"                
## [160] "woning"               "het"                  "licham"              
## [163] "van"                  "een"                  "man"                 
## [166] "gevond"               "en"                   "de"                  
## [169] "verdenk"              "valt"                 "al"                  
## [172] "snel"                 "op"                   "zijn"                
## [175] "alcoholistisch"       "vrouw"                "ook"                 
## [178] "dez"                  "zak"                  "zal"                 
## [181] "dor"                  "griessel"             "gecoordineerd"       
## [184] "moet"                 "word"                 "wat"                 
## [187] "hebb"                 "dez"                  "zak"                 
## [190] "gemen"                "en"                   "word"                
## [193] "ze"                   "op"                   "tijd"                
## [196] "opgelost"             "zoal"                 "de"                  
## [199] "titel"                "van"                  "het"                 
## [202] "boek"                 "al"                   "suggereert"          
## [205] "speelt"               "het"                  "verhal"              
## [208] "zich"                 "in"                   "dertien"             
## [211] "uur"                  "af"                   "die"                 
## [214] "kort"                 "tijdspann"            "zorgt"               
## [217] "er"                   "wel"                  "vor"                 
## [220] "dat"                  "het"                  "een"                 
## [223] "enorm"                "tempo"                "heeft"               
## [226] "en"                   "het"                  "verhal"              
## [229] "bij"                  "wijz"                 "van"                 
## [232] "sprek"                "voorbij"              "is"                  
## [235] "vor"                  "je"                   "er"                  
## [238] "erg"                  "in"                   "hebt"                
## [241] "omdat"                "meyer"                "zich"                
## [244] "dor"                  "dez"                  "opzet"               
## [247] "gen"                  "inleid"               "kon"                 
## [250] "permitter"            "ben"                  "je"                  
## [253] "als"                  "lezer"                "al"                  
## [256] "vanaf"                "de"                   "eerst"               
## [259] "bladzijd"             "bij"                  "het"                 
## [262] "verhal"               "betrok"               "je"                  
## [265] "zit"                  "er"                   "meten"               
## [268] "volop"                "in"                   "en"                  
## [271] "pas"                  "als"                  "het"                 
## [274] "boek"                 "is"                   "dichtgeslag"         
## [277] "kun"                  "je"                   "op"                  
## [280] "adem"                 "kom"                  "daarbij"             
## [283] "kenmerkt"             "de"                   "plot"                
## [286] "zich"                 "ook"                  "nog"                 
## [289] "een"                  "dor"                  "diver"               
## [292] "onverwacht"           "en"                   "dus"                 
## [295] "verrass"              "ontwikkel"            "het"                 
## [298] "zijn"                 "stuk"                 "vor"                 
## [301] "stuk"                 "ingredient"           "13"                  
## [304] "uur"                  "van"                  "een"                 
## [307] "flink"                "dosis"                "spanning"            
## [310] "hebb"                 "voorzien"             "het"                 
## [313] "verhal"               "bestat"               "uit"                 
## [316] "twee"                 "verhaallijn"          "die"                 
## [319] "van"                  "de"                   "moord"               
## [322] "op"                   "het"                  "ene"                 
## [325] "en"                   "de"                   "vlucht"              
## [328] "en"                   "verdwijn"             "van"                 
## [331] "het"                  "ander"                "meisj"               
## [334] "en"                   "ook"                  "die"                 
## [337] "van"                  "de"                   "moord"               
## [340] "op"                   "de"                   "man"                 
## [343] "die"                  "een"                  "bekend"              
## [346] "platenbas"            "blijkt"               "te"                  
## [349] "zijn"                 "gedur"                "de"                  
## [352] "plot"                 "wijst"                "niet"                
## [355] "erop"                 "dat"                  "beid"                
## [358] "verhal"               "met"                  "elkar"               
## [361] "te"                   "mak"                  "hebb"                
## [364] "dat"                  "hebb"                 "ze"                  
## [367] "in"                   "feit"                 "ook"                 
## [370] "niet"                 "mar"                  "in"                  
## [373] "de"                   "ontknop"              "blijk"               
## [376] "ze"                   "toch"                 "een"                 
## [379] "raakvlak"             "te"                   "hebb"                
## [382] "iet"                  "dat"                  "de"                  
## [385] "lezer"                "lang"                 "tijd"                
## [388] "niet"                 "ziet"                 "aankom"              
## [391] "de"                   "onderzoek"            "in"                  
## [394] "die"                  "afzonder"             "verhal"              
## [397] "word"                 "verricht"             "dor"                 
## [400] "een"                  "aantal"               "nieuwbak"            
## [403] "inspecteur"           "griessel"             "is"                  
## [406] "hun"                  "mentor"               "interessant"         
## [409] "personages"           "die"                  "als"                 
## [412] "ze"                   "ook"                  "nog"                 
## [415] "in"                   "de"                   "rest"                
## [418] "van"                  "de"                   "serie"               
## [421] "vor"                  "gan"                  "kom"                 
## [424] "allen"                "mar"                  "kunn"                
## [427] "groei"                "meyer"                "zou"                 
## [430] "meyer"                "niet"                 "zijn"                
## [433] "als"                  "hij"                  "ook"                 
## [436] "in"                   "13"                   "uur"                 
## [439] "gen"                  "maatschapp"           "problem"             
## [442] "nar"                  "vor"                  "brengt"              
## [445] "dit"                  "problem"              "is"                  
## [448] "niet"                 "prominent"            "aanwez"              
## [451] "mar"                  "het"                  "sluimert"            
## [454] "dor"                  "het"                  "verhal"              
## [457] "hen"                  "want"                 "hij"                 
## [460] "maakt"                "hel"                  "erg"                 
## [463] "duidelijk"            "dat"                  "het"                 
## [466] "land"                 "nog"                  "sted"                
## [469] "te"                   "kamp"                 "heeft"               
## [472] "met"                  "de"                   "gevolg"              
## [475] "van"                  "de"                   "apart"               
## [478] "ook"                  "de"                   "nog"                 
## [481] "sted"                 "aanwez"               "corruptie"           
## [484] "van"                  "politiefunctionariss" "maakt"               
## [487] "del"                  "uit"                  "van"                 
## [490] "het"                  "verhal"               "daarin"              
## [493] "heeft"                "het"                  "land"                
## [496] "nog"                  "sted"                 "gen"                 
## [499] "positiev"             "verander"             "ondergan"            
## [502] "dez"                  "problematiek"         "maakt"               
## [505] "het"                  "verhal"               "bijzonder"           
## [508] "realistisch"          "dat"                  "geldt"               
## [511] "overigen"             "ook"                  "vor"                 
## [514] "de"                   "beeldend"             "en"                  
## [517] "levend"               "beschrijv"            "die"                 
## [520] "de"                   "auteur"               "hanteert"            
## [523] "toch"                 "kan"                  "er"                  
## [526] "een"                  "klein"                "wat"                 
## [529] "kritisch"             "kantteken"            "word"                
## [532] "geplaatst"            "een"                  "van"                 
## [535] "de"                   "nieuw"                "inspecteur"          
## [538] "mbali"                "kaleni"               "is"                  
## [541] "nogal"                "for"                  "en"                  
## [544] "dar"                  "legt"                 "meyer"               
## [547] "som"                  "onnod"                "de"                  
## [550] "nadruk"               "op"                   "dat"                 
## [553] "weegt"                "echter"               "niet"                
## [556] "op"                   "teg"                  "het"                 
## [559] "verhal"               "op"                   "zich"                
## [562] "want"                 "dat"                  "stat"                
## [565] "als"                  "een"                  "huis"                
## [568] "zonder"               "enig"                 "vorm"                
## [571] "van"                  "twijfel"              "kan"                 
## [574] "word"                 "gesteld"              "dat"                 
## [577] "13"                   "uur"                  "een"                 
## [580] "thriller"             "van"                  "format"              
## [583] "is"

Making a bag of words

We have only worked on an single review so far, but we want to work on our entire dataset. We will use tidytext to work with our data. Here we reshape our goodreads_english dataframe: each review is split over multiple rows, so we only have one word per row.

goodreads_by_word <- goodreads_english %>%
  unnest_tokens(word, text)

head(goodreads_by_word)

Now we can use count to get the most common elements in the word column:

goodreads_counts = goodreads_by_word %>%
  count(word, sort=TRUE)

goodreads_counts

Let’s try visualising our results in a wordcloud.

goodreads_counts %>%
  with(wordcloud(word, n, max.words=100))

How many words are there in this dataset anyway? In text mining, we differentiate between types and tokens. The number of types tells us how many different words occur, and the number of tokens tells us how many words occur in total. For example, “to be or not to be” has 4 types (to, be, or, and not), and 6 tokens (to, be, or, not, to, and be).

Exercise 4: How many types are in goodreads_counts? And how many tokens?

# number of types
nrow(goodreads_counts)
## [1] 30839
#number of tokens
sum(goodreads_counts['n'])
## [1] 1326809

Exercise 5: The unnest_tokens that we used above takes the same arguments as tokenize_words: you can tweak whether you want to turn everything into lowercase and whether to clean up punctuation or numbers. Create a new table where you adjust one of these options, and report on how it affects the number of types and tokens.

goodreads_count_no_lowercasing <- goodreads_english %>%
  unnest_tokens(word, text, to_lower=FALSE) %>%
  count(word)
 
print('Number of types:')
## [1] "Number of types:"
print(nrow(goodreads_count_no_lowercasing))
## [1] 36137
print('Number of tokens:')
## [1] "Number of tokens:"
print(sum(goodreads_count_no_lowercasing['n']))
## [1] 1326809
# we can see that lowercasing does not affect the number of tokens, but it did lower the number of types

Remove stopwords

The wordcloud we made in the previous section is nice, but it contains a lot of generic words like “the”, “and”, etc. Let’s try removing stopwords.

We will use get_stopwords from the tidytext package to get a list of stopwords in English.

stopwords_english <- get_stopwords()
stopwords_english

Now, we want to exclude these from our table of words. We can use anti_join, which means that we cross-reference with another table, and remove everything that is also in the other table. Then, we use count again to get a new frequency table.

goodreads_without_stopwords <- goodreads_by_word %>%
  anti_join(stopwords_english, by='word') %>%
  count(word, sort=TRUE)

head(goodreads_without_stopwords)

This already looks a lot more informative: the most frequent words include book, read story, and characters: they tell us something about what this dataset is about. We can make another wordcloud that does not include the stopwords:

goodreads_without_stopwords %>%
  with(wordcloud(word, n, max.words=100))

Exercise 6: We can use filterto select the reviews from a specific book, like so. Make a word cloud for the reviews of Nineteen eighty-four and Gone girl and compare them side-by-side. Do they look very different?

goodreads_english %>%
  filter(book_title == "Nineteen eighty-four")

POS tagging

tidytext provides a table parts_of_speech which lists potential parts of speech for English words. We can use this for a very basic way of assigning a POS to each word, by looking up each word in the table. Here we do just that: left_join means that for each row in goodreads_by_word, we look up the corresponding row in parts_of_speech and combine the data.

goodreads_pos <- goodreads_by_word %>%
  left_join(parts_of_speech, by="word", multiple="first")

head(goodreads_pos)

Be aware that this is a very crude way of assigning POS tags. We usually want to take the rest of the sentence into account, so we can differentiate between two sentences like:

  • I want to read this book.
  • This book is a great read.

You can see this going wrong in the table above: “read” is classified as a noun instead of a verb. However, we can accept a some inaccuracy for our purposes here.

Exercise 7: Try using filter to only get a table containing only adjectives. Then filter on the rating so you can compare the adjectives used in the highest ratings (5) with those used in the lowest ratings (1). Make a wordcloud for both.

goodreads_adjectives <- goodreads_pos %>%
  filter(pos == "Adjective")
goodreads_adjectives %>%
  filter(rating == 5) %>%
  count(word, sort=TRUE) %>%
  with(wordcloud(word, n, max.words=100))

goodreads_adjectives %>%
  filter(rating == 1) %>%
  count(word, sort=TRUE) %>%
  with(wordcloud(word, n, max.words=100))

Mixed-language data

goodreads_mixed contains reviews in a mix of languages. We can see how many reviews we have for each:

goodreads_mixed %>%
  count(language, sort=TRUE)

Exercise 8: Filter the Spanish reviews (or pick another language, if you like). Go through the preprocessing steps we have covered so far and visualise the results. Think about which steps would need to be adjusted based on the language, and adjust them accordingly.

# filter reviews
goodreads_spanish <- goodreads_mixed %>%
  filter(language == "Spanish")

head(goodreads_spanish)
# tokenise
goodreads_spanish_words <- goodreads_spanish  %>%
  unnest_tokens(word, text)

head(goodreads_spanish_words)
# remove stopwords
goodreads_spanish_words_clean <- goodreads_spanish_words %>%
  anti_join(get_stopwords(language="es"), by="word")
goodreads_spanish_words_clean %>%
  count(word) %>%
  with(wordcloud(word, n, max.words=100))