Introduction

Welcome to the first practical of the course “Introduction to Text Mining with R”.

In the practicals we will get hands-on experience with the materials in the lectures by programming in R and completing assignments. In this practical, we will work with different formats of text data, regular expressions and text visualizations.

The practicals always start with the packages we are going to use. Be sure to run these lines in your session to load their functions before you continue. If there are packages that you have not yet installed, first install them with install.packages().

library(dplyr)     # for data manipulation
library(plyr)      # for data manipulation
library(magrittr)  # for pipes
library(tidyverse) # for tidy data and pipes
library(ggplot2)   # for visualization
#library(xlsx)      
#library(readxl)    
library(openxlsx)  # for working with excel files
library(qdap)      # provides parsing tools for preparing transcript data; may need https://www.java.com/en/download/manual.jsp
library(wordcloud) # to create pretty word clouds
library(stringr)   # for working with regular expressions

Before starting the exercises, you need to set your working directory to your Practicals folder (or create a project folder). To this end, you can either create a project in RStudio and move the Rmd and data files to the project folder, or you can use the following line instead of creating a new project:

# setwd("drivename:/folder1/folder2/r_tm/monday/Practicals/")

Reading text data: Excel files


  1. Use the read.xlsx function from the openxlsx package to read the blog-gender-dataset.xlsx file. Alternatively, you can use the read.xlsx function from the xlsx package or the read_excel function from the readxl package.

The dataset is already located in the practical folder. You can also find it online from here

blog_gender_data <- read.xlsx("blog-gender-dataset.xlsx", 1)%>% 
    select(blog, gender)

  1. Randomly sample 1000 of the blogs into a new variable and name it blog_gender_1000. Hint: use the sample_n function. Convert the object to a tibble.

To learn more about the application of a function or feature in R, you can use one of the two following help commands:

# help(your_function_name)
# ?your_function_name
set.seed(123)
blog_gender_1000 <- sample_n(blog_gender_data, 1000)
#blog_gender_1000 <- slice_sample(blog_gender_data, n = 1000) #same as the line above
blog_gender_1000 <- as_tibble(blog_gender_1000)

  1. Use the head, tail, and View functions to check the dataset. What are the differences between these functions?

head(blog_gender_1000)    
tail(blog_gender_1000)
# View(blog_gender_1000)

  1. wordcloud is a function from the wordcloud package which plots cool word clouds based on word frequencies in the given dataset. Use this function to plot the top 50 frequent words with minimum frequency of 10. Also use the exposition pipe operator %$%.

The %$% pipe exposes the listed dimensions of a dataset, such that we can refer to them directly.

set.seed(123)
blog_gender_1000 %$% wordcloud(blog, min.freq = 10, max.words = 50, random.order = FALSE,
                               colors = brewer.pal(8, "Dark2"))


Reading text data: CSV files


  1. Use the read.csv function to read the IMDB Dataset.csv file. Convert the object into a tibble.

IMDB dataset contains 50K movie reviews for natural language processing or Text analytics. You can also access the dataset from the Kaggle website.

imdb_data <- read_csv("IMDB Dataset.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   review = col_character(),
##   sentiment = col_character()
## )
imdb_data <- as_tibble(imdb_data)

  1. Use the head and tail functions to take a look at the dataset.

head(imdb_data)
tail(imdb_data)

  1. Randomly sample 500 of the reviews into a new variable and name it imdb_500. In your sample take care of having equal number of observations from positive and negative reviews. (250 each)

set.seed(123)
imdb_500 <- ddply(imdb_data, "sentiment", .fun = function(x) {sample_n(x, size = 250)})
#imdb_500 <- ddply(imdb_data, "sentiment", .fun = function(x) {slice_sample(x, n = 250)})

  1. Plot the word clouds of positive and negative reviews in your sample separately and compare the visualizations.

set.seed(123)
imdb_500 %>% 
  filter(sentiment == "positive") %$% 
  wordcloud(review, min.freq = 20, 
                       max.words = 50, random.order = FALSE, 
                       colors = brewer.pal(8, "Dark2"))

set.seed(123)
imdb_500 %>% 
  filter(sentiment == "negative") %$% 
  wordcloud(review, min.freq = 20, 
                       max.words = 50, random.order = FALSE, 
                       colors = brewer.pal(8, "Dark2"))


  1. Use the freq_terms function from the qdap package and find the 30 top terms in each set of positive and negative reviews in imdb_500.

frequent_terms_pos <- imdb_500 %>% 
  filter(sentiment == "positive") %$% 
  freq_terms(review, top = 30)

frequent_terms_neg <- imdb_500 %>% 
  filter(sentiment == "negative") %$% 
  freq_terms(review, top = 30)

  1. Using ggplot, plot a barchart for each of the freqent terms dataframes.


ggplot(frequent_terms_pos, aes(x = reorder(WORD, FREQ), y = FREQ)) + 
geom_bar(stat = "identity") + 
coord_flip()                + 
xlab("Word in Corpus")      + 
ylab("Count")               + 
theme_minimal()


ggplot(frequent_terms_neg, aes(x = reorder(WORD, FREQ), y = FREQ)) + 
geom_bar(stat = "identity") + 
coord_flip()                + 
xlab("Word in Corpus")      + 
ylab("Count")               + 
theme_minimal()

Package ggplot2 offers far greater flexibility in data visualization than the standard plotting devices in R. However, it has its own language, which allows you to easily expand graphs with additional commands. To make these expansions or layers clearly visible, it is advisable to use the plotting language conventions. All ggplot2 documentation can be found at http://docs.ggplot2.org/current/


Reading text data: TXT files


  1. Use the readLines function to read data from the ‘computer.txt’ file.

Computer.txt is an annotated dataset for the purpose of aspect-based sentiment analysis. You can find it from here.

computer_data <- readLines("computer.txt")

  1. Convert the data to a dataframe and name it computer_531.

computer_531 <- data.frame(computer_data)

As you may have noticed, here, we worked only with tibbles and dataframes. In addition to these, we will also use VCorpus data type.


  1. Use the head and tail functions to take a look at the dataset.

head(computer_531)
tail(computer_531)

Regular expressions


  1. Except computer_531, imdb_500, and blog_gender_1000 remove the other files from the RStudio environment

# using regular expressions
rm(list = ls()[grep("data", ls())])
rm(list = ls()[grep("^fre", ls())])
gc() # "garbage collector" to free up memory of removed variables
##           used (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 1783038 95.3    3670955 196.1  3670955 196.1
## Vcells 3295401 25.2   16423800 125.4 16423767 125.4

# or
# rm(blog_gender_data, computer_data, frequent_terms_pos, frequent_terms_neg, imdb_data)

# or use except files
# rm(list=setdiff(ls(), c("computer_531", "imdb_500", "blog_gender_1000")))

  1. Use regular expressions to find the number of reviews in imdb_500 that contain words “Action” or “action”, “Comedy” or “comedy”, and “Drama” or “drama”. (each genre separately)

reviews_act <- grep("[Aa]ction", imdb_500$review)
reviews_com <- grep("[Cc]omedy", imdb_500$review)
reviews_dra <- grep("[Dd]rama", imdb_500$review)

genres    <- c("Action", "Comedy", "Drama")
nr_genres <- c(length(reviews_act), length(reviews_com), length(reviews_dra))

tibble(genres, nr_genres)

  1. Use regular expressions to find the reviews in imdb_500 that contain word “Comedy”. Hint: Set the value argument in the grep function equal to TRUE

grep("Comedy", imdb_500$review, value = TRUE)
## [1] "*****WARNING, MAY CONTAIN SPOILERS WHICH WILL BE MORE ENTERTAINING THAN THIS TRIPE.**** <br /><br />Heres some good advise to anyone living in the U.K. Whenever Channel 5 has an old 80's comedy on late at night, read a book instead. I am currently in the process of recovering from a seizure, due to reading some of the comments on this film on here. I am actually shocked at the fact that someone actually said this film was realistic! All I can say is thank god the Cold War never escalated or else we might as well have given the Commie's our borders... I found this film dire in the utmost pretence, maybe it is just my British perception of what makes a film funny, who knows? But in all aspects, this film is not just awful, its teeth grindingly terrible.<br /><br />I've never been a fan of Bill Murray, and its rubbish like this that justify my feelings towards him. Don't get me wrong, I loved Ghostbuster's, which was made only three years after this film. But this just sums Bill Murray up really. I can safely say that I haven't wasted my time so blatantly like this since seeing the first running of Operation Delta Force over here, though these two films have more in common than you would think. For 1 thing, they both have terrible action sequences from beginning to end, and 2nd. They are both riddled with cheesy Cliché's, throughout.<br /><br />Heres one thing, these guys are supposed to be in the \"U.S Army\". Yet they are allowed to wallow around their Camp, Willy nilly, seducing female Military Police Officers, and subsequently shagging them silly in the Generals Quarters. Talk about Random! This film is just terrible for this I'm afraid. Now don't get me wrong, I'm no feminist sympathiser, but the fact that these two women actually fall hands over heels in love with the two characters shortly after arresting them, letting them go free... Twice, is just insulting to the female race. The fact that one tatty haired, fat lipped bum (Winger) and his hapless sidekick Ramis can simply sweet talk themselves into into the MP's underwear, to which they fall madly in love with the two of them is nothing short of ludicrous.<br /><br />Then there is the training scenes, where you get to meet the Squad \"Phycho\" who unconvincingly threatens to kill anyone who touches him or his stuff, followed by the overweight bloke (played by the late and great John Candy) who claims he joined the army to \"avoid paying $400 for anger management classes\". Leading to loud mouthed Murray paying tribute to the \"Giant Toe,\" (WTF?) 'Drill Seargent' who honestly couldn't organise a pi$$ up in a brewery, let alone his band of recruits. All this scene serves to do is to prelude loads of fight scenes, with people saying \"way to go ass hole'!\" all the time, etc etc.<br /><br />The scenes then carry on showing the rag tag bunch making utter tits of themselves on the Assault Course, leading to a scene where one of them shoots wildly into the air at some passing birds with an assault rifle, peppering a watch tower with bullets. (Just like that. Yep, told you this film was random...He miraculously escapes undisciplined as well...) Eventually Leading up the the passing out parade, where the hapless squad make a magic turn around within the space of two hours. (Bugger me, Miracle!) Thanks to some wise words from Murray, to which they then direct a massively none military like dance routine in front of a Geriatric 'General' in front of the rest of the squads. All of this to the immense pleasure of their two Girlfriends on the stand, who really should've been arresting them... Everyone laughs it off though. This bit is nothing short of amazing though. He then chooses them to guard a new Multi-Million Dollar Prototype Armoured Vehicle in Italy (which turns out to be just a mobile home painted green with loads of gadgets on the inside), claiming \"This is exactly what this Army needs!\" righto...<br /><br />Then there is the dire finale, where Murray and Ramis decide to steal this top-secret prototype Military Vehicle to pick up their newly acquired and somewhat Hyperactive MP Girlfriends in Germany. To which the Hapless Captain (John Larroquette) then finds out and leads the Squad of fresh recruits on a retrieval Mission for this vehicle. To which they then take a \"wrong turn\" en-route and end up in Soviet Held Czechoslovakia, where they are captured. (Like we didn't see that coming...) Thus begins a rescue attempt by Ramis and Murray + Birds in hand, to which is where a big fight, loads of shooting from the hip and blowing tanks up. With them coming back as National Heroes, humiliating the Russians by calling them \"pussies,\" etc etc. The end. Thats right. No Courts Martial, nothing. They only just stole a prototype Military vehicle, drove it into a Warsaw Pact country and almost caused an International incident which could've sparked WW3!<br /><br />This film is honestly more fun that being diagnosed with a terminal illness. I know its meant to be a Comedy, it got all the right actors for it, but where in the hell is it? Have Channel 5 cuts those bits out? The only redeeming feature in this film is the repetitive use of naked women taking showers, and female Mud Wrestling. (like I said, Random) Not that it helps to divert from the fact that this is an utterly crap film, of course. This film should realistically be aimed at immature 9 year old's, sadly, we have to watch it instead. 1 star out of 10 - Total Tripe. My advice, do something a little more useful with your time. Like Castrating yourself..."
## [2] "4 realz son my game iz mad tite yo I cant wait 2 get on dis show and roll up in da club n do it real 905 style wit mad models n bottles, son!<br /><br />No, I'm just kidding. This is a sad show, created by, and for the enjoyment of, sad men. Men who are so neutered by modern existence that they channel their frustration into the clubs, where they eke out fleeting self-validation preying on chicks in hopes of getting their little wieners touched to try and dull the sting of loneliness and make them feel, even if just for one night, as though their seat on the Board of the Sausage Party of Toronto is a little less permanent. <br /><br />I read some comments on here saying that this show represents Canadian TV's finally stepping up to stand on a par with American TV or somethingorother. Well, that's not aiming short at all. It's like, Yes! Pat yourself on the back, Canada -- you've finally cracked the elusive formula for such groundbreaking American content as \"Studs\", \"Change of Heart\", \"Elimidate\" and \"The Fifth Wheel\". See, the real brainchild here is tacking \"...meets Candid Camera\" onto the pitch. Genius. And there's nothing that straddles that thin line between fratboy camaraderie and latent homosexuality like a group of grown men taping each other on hidden camera, admiring each other's \"game\" up in the club. The man-love on display here is so palpable they should really consider rechristening it \"Keys to the Steam Bath\".<br /><br />On a side note, how interesting that the folks who gave this show such glowing reviews seem to have registered an IMDb account for the express purpose of doing so (I guess I'm guilty of employing the same means to do the opposite here.) My personal favorite is the one enthusiastic reviewer that claims to hail from the \"United States\" who gushes that \"Now it's clear that the talent in Canada has the ability to produce American quality television.\" <br /><br />Smooth. <br /><br />But why even bother manufacturing online buzz? You can't really get cancelled, after all -- you're on the Comedy Network in Canada, baby! The viewing public will go on ignoring your show for years to come. In all likelihood you'll be just fine, coasting comfortably along that proverbial plain of mediocrity with the majority of the Comedy Network's original programming."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
## [3] "Let me start by saying I don't recall laughing once during this comedy. From the opening scene, our protagonist Solo (Giovanni Ribisi) shows himself to be a self-absorbed, feeble, and neurotic loser completely unable to cope with the smallest responsibilities such as balancing a checkbook, keeping his word, or forming a coherent thought. I guess we're supposed to be drawn to his fragile vulnerability and cheer him on through the process of clawing his way out of a deep depression. I guess we're supposed to sympathize as he stumbles through a series of misadventures seemingly triggered by his purchase of a dog, but in reality brought on by his own contemptible nature. I didn't get the slightest hint at any point that Solo ever possessed any redeeming character, which became disturbingly apparent when he failed to feed his dog for a few days. No spark of humanity or glimmer of conscience gave me hope that he would ever realize his life is so utterly miserable because he's a self- absorbed, self-pitying lowlife. I didn't develop any connection with this character. He didn't seem to care, and so neither did I. I actually wanted him to get his kneecaps busted at one point.<br /><br />The dog was not a character in the film. It was simply a prop to be used, neglected, scorned, abused, coveted and disposed of on a whim. So be warned. Even though \"dog\" is in the title, this film is not a romantic comedy for dog lovers.<br /><br />Scott Caan's role is amusing and believable as the oversexed best friend/cad. Don Cheadle is sincere and magnetic - I always want to see more of him on screen. Mena Suvari was delightfully repellent. Lynn Collins role of a \"stripper with a heart\" was well acted, but the character was simultaneously absurd and clichéd, not to mention there was zero chemistry between her and Ribisi.<br /><br />Romantic? Hardly. Comedy? If you say so."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
## [4] "You got to go and dig those holes. Holes only leaves troble, which makes a movie so good. Disney has done it again.Shia LaBeouf should be nominated for Best Actor for his performance as Stanley Yelnats. He has alredy won the Daytime Emmy for Best Actor in a Comedy Series (Even Stevens). Holes is one of the best movies in 2003."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [5] "After the success of the second instalment, Richard Curtis and Ben Elton decided that Blackadder should have a third appearance. This time instead of Tudor times or Elizabethan times, Edmund Blackadder (BAFTA nominated Rowan Atkinson) is living in the time of the French Revolution. Accompanied by the now stupid but lovable Baldrick (Tony Robinson) Blackadder is the \"faithful\" butler to George, the Prince Regent of Wales (Hugh Laurie). Throughout this third series to the wonderfully written sitcom Blackadder tries everything he can to get rich and powerful. He tries electing a lord for a rotten borough, tries to sell a book, tries to win a bet about The Scarlet Pimpernel, tries to be a highway man and finally poses as the Prince. This is a very good instalment to the popular comedy. Includes appearances from Robbie Coltrane, Tim McInnerny, Miranda Richardson and Stephen Fry. It won the BAFTA for Best Comedy Series, and it was nominated for Best Design and Best Make Up. Rowan Atkinson was number 18 on The 50 Greatest British Actors, he was number 24 on The Comedians' Comedian, and he was number 8 on Britain's Favourite Comedian, Edmund Blackadder was number 3 on The 100 Greatest TV Characters, and he was number 3 on The World's Greatest Comedy Characters, and Blackadder (all four series) was number 2 on Britain's Best Sitcom. Outstanding!"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
## [6] "A very funny movie. Michael Douglas' \"do\" is worth watching this flick for if for no other reason. I'd like to see him do more of these low life roles. He was terrific, as were all the performers.<br /><br />The film struck me right off as an American Roshomon, only funnier and easier to watch because it was in American and didn't need no stinkin subtitles!<br /><br />In a funny movie with a laugh every minute or so, two of the best were with John Goodman (not someone I am crazy about) - 1. He is telling the priest about Jewel doing something he liked and says \"I had to wipe the smile off my face.\" The visual shows he is not smiling and clearly is a guy who never smiles, but probably doesn't know it. 2. The scene at the end between Goodman, all suited up for Jewel in his cop uniform, and grappling with the be-leathered Reiser hunched over a table... and the two of them then protesting that they are not gay to another character who happens on the scene - this alone deserved a special Comedy Academy Award."

# you can also use the returned index instead of setting value 

grep() and the grepl() functions have some limitations. These functions tell you which strings in a character vector match a certain pattern but they don’t tell you where exactly the match occurs or what the match is for more complicated regular expression.


  1. Use the regexpr function to check if there is a fully capitalized word in each of the first 20 blogs in the blog_gender_1000 data.

r <- regexpr('\\b[A-Z]+\\b', blog_gender_1000[1:20,]$blog) #\\b marks word boundary
r
##  [1]  -1  10   6  52 164  10  23 274 149 183   5  -1   9  74 514 314  11   1   1
## [20] 180
## attr(,"match.length")
##  [1] -1  1  1  1  1  1  1  1  1  1  1 -1  2  1  1  2  1  1  1  3
regmatches(blog_gender_1000[1:20,]$blog, r)
##  [1] "I"   "D"   "I"   "I"   "I"   "I"   "U"   "I"   "T"   "I"   "UD"  "I"  
## [13] "I"   "TI"  "I"   "I"   "I"   "UTI"

The regexpr() function gives you the (a) index into each string where the match begins and the (b) length of the match for that string.

The str_extract function from stringr is also a useful function for this purpose:

str_extract(blog_gender_1000[1:20,]$blog, '\\b[A-Z]+\\b')
##  [1] NA    "I"   "D"   "I"   "I"   "I"   "I"   "U"   "I"   "T"   "I"   NA   
## [13] "UD"  "I"   "I"   "TI"  "I"   "I"   "I"   "UTI"

  1. regexpr and str_extract only give you the first match of the string (reading left to right). On the other hand, gregexpr and str_match_all will give you all of the matches in a given string if there is more than one match. Use either of these functions to check if there are fully capitalized words in each of the first 10 blogs in the blog_gender_1000 data.

r <- gregexpr('\\b[A-Z]+\\b', blog_gender_1000[1:10,]$blog)
regmatches(blog_gender_1000[1:10,]$blog, r)
## [[1]]
## character(0)
## 
## [[2]]
## [1] "I"   "BSC" "I"   "I"   "I"   "I"   "I"   "I"  
## 
## [[3]]
## [1] "D"    "I"    "MUST" "I"    "D"    "I"   
## 
## [[4]]
##  [1] "I"         "I"         "A"         "A"         "A"         "DON"      
##  [7] "T"         "WRITE"     "HEADLINES" "IN"        "ALL"       "CAPS"     
## [13] "BREAKING"  "I"         "I"        
## 
## [[5]]
##  [1] "I"      "I"      "A"      "B"      "A"      "B"      "I"      "I"     
##  [9] "I"      "WILL"   "YOU"    "I"      "I"      "I"      "I"      "I"     
## [17] "I"      "I"      "I"      "I"      "I"      "I"      "I"      "I"     
## [25] "I"      "I"      "I"      "I"      "I"      "I"      "I"      "HONEST"
## 
## [[6]]
## [1] "I" "I" "I"
## 
## [[7]]
##  [1] "I"    "M"    "NOT"  "AT"   "WORK" "I"    "I"    "I"    "I"    "I"   
## [11] "I"    "WHY"  "I"    "I"    "I"   
## 
## [[8]]
## [1] "U"   "MBA" "MBA" "P"  
## 
## [[9]]
##   [1] "I"           "I"           "I"           "I"           "I"          
##   [6] "I"           "I"           "I"           "I"           "A"          
##  [11] "GROUP"       "OF"          "DIVERSE"     "CREATIVE"    "INSPIRING"  
##  [16] "WOMEN"       "ME"          "I"           "I"           "I"          
##  [21] "I"           "I"           "I"           "I"           "I"          
##  [26] "I"           "I"           "TRUE"        "I"           "I"          
##  [31] "I"           "I"           "I"           "I"           "I"          
##  [36] "I"           "I"           "O"           "MOM"         "ME"         
##  [41] "I"           "I"           "I"           "WAIT"        "FOR"        
##  [46] "IT"          "THE"         "FATAL"       "BLUE"        "SCREEN"     
##  [51] "ERROR"       "I"           "WTF"         "I"           "AND"        
##  [56] "SYSTEM"      "HAS"         "EXPERIENCED" "A"           "FATAL"      
##  [61] "DISC"        "ERROR"       "I"           "IT"          "I"          
##  [66] "MORE"        "I"           "I"           "I"           "I"          
##  [71] "BAD"         "I"           "O"           "I"           "I"          
##  [76] "I"           "I"           "BORED"       "I"           "OUT"        
##  [81] "BB"          "I"           "I"           "I"           "LOVING"     
##  [86] "I"           "I"           "A"           "I"           "I"          
##  [91] "I"           "EVERYTHANG"  "I"           "I"           "I"          
##  [96] "EXCEPT"      "I"           "I"           "YES"         "I"          
## [101] "I"           "I"           "GOLD"        "I"           "GOOOOOOD"   
## [106] "LONG"        "I"           "LONG"        "I"           "I"          
## [111] "S"           "O"           "S"           "I"           "NO"         
## [116] "I"           "I"           "I"           "I"           "I"          
## [121] "I"           "I"           "I"           "DVD"         "TV"         
## [126] "NOTHING"     "I"           "I"           "I"           "TWO"        
## [131] "I"           "I"           "LITTLE"      "MOST"        "A"          
## [136] "KNOWING"     "BB"          "I"           "I"           "FINISHED"   
## [141] "I"           "I"           "OFF"         "I"          
## 
## [[10]]
##  [1] "T"   "US"  "US"  "OK"  "I"   "I"   "I"   "I"   "I"   "I"   "I"   "USB"
## [13] "I"   "I"   "I"   "I"   "I"   "I"   "I"   "AMD" "I"   "TED"

# or
#str_match_all(blog_gender_1000[1:10,]$blog, '\\b[A-Z]+\\b')

  1. Now we want to process the computer_531 data and separate aspects and sentiments for each record. First, use regular expression to extract the characters at the beginning of each line until ##. Apply this only for the first 20 reviews in data. Use str_extract or gsub function.


str_extract(computer_531[1:20,], "[^#]*")
##  [1] ""                                  "inexpensive[+1][a] "              
##  [3] "monitor[-1] "                      "screen[-1], picture quality[-1] " 
##  [5] "monitor[-1], picture quality[-1] " "screen[-1] "                      
##  [7] "Display[-1] "                      "monitor[-1] "                     
##  [9] ""                                  ""                                 
## [11] ""                                  ""                                 
## [13] ""                                  "monitor[-1], colors[-1] "         
## [15] ""                                  "size[+1] "                        
## [17] "computer[+1], quality[+1] "        ""                                 
## [19] "keyboard[+1], color[+1] "          "speed[+1], memory[+1] "

# gsub("[^##]*", "\\1", computer_531[1:20,])

gsub("(.?)(##.*)", "\\1", computer_531[1:20,])
##  [1] ""                                  "inexpensive[+1][a] "              
##  [3] "monitor[-1] "                      "screen[-1], picture quality[-1] " 
##  [5] "monitor[-1], picture quality[-1] " "screen[-1] "                      
##  [7] "Display[-1] "                      "monitor[-1] "                     
##  [9] ""                                  ""                                 
## [11] ""                                  ""                                 
## [13] ""                                  "monitor[-1], colors[-1] "         
## [15] ""                                  "size[+1] "                        
## [17] "computer[+1], quality[+1] "        ""                                 
## [19] "keyboard[+1], color[+1] "          "speed[+1], memory[+1] "

  1. Add two new columns to computer_531 and name them aspect_sentiment and review. Fill the aspect_sentiment column for the dataframe with the command you found in the previous question. In the review column, only keep the review text for each line in the dataframe.

computer_531$aspect_sentiment <- gsub("(.?)(##.*)", "\\1", computer_531[,])
head(computer_531)

## keep only reviews:

computer_531$review <- sub('(.*)(##?)', '', computer_531[,]$computer_data)

The difference is that sub only replaces the first occurrence of the pattern specified, whereas gsub does it for all occurrences (that is, it replaces globally). sub and gsub perform replacement of the first and all matches respectively.


  1. Now remove the sentiment scores from aspects. Add only the aspect to another column to computer_531 and name it aspects.

computer_531$aspects <- gsub("\\[.+?\\]", "", computer_531$aspect_sentiment) #the \\ is used to escape [ and ]
head(computer_531)

# computer_531 <- computer_531 %>% 
#  mutate(aspects = gsub("\\[.+?\\]", "", computer_531$aspect_sentiment))


# only for +1 and -1
# gsub("\\[(.+?)1\\]", "", "a yes[+1][a]")
# gsub("\\[(\\+|-)1\\]", "", "a yes[+1][a]")

Using unlist(strsplit(string, ",")) gives you back a vector of separated aspects.


  1. Create a new column sentiment with the values positive, negative and neutral. Set a value neutral if there is no aspect in the corresponding column or the sum of scores is equal to zero. Save the resulting object as a csv. file and name it computer_531.csv

computer_531 <- computer_531 %>%
  mutate(sentiment_score = 0)

for(i in 1:nrow(computer_531)){ 
  score_list <- str_extract_all(computer_531[i,]$aspect_sentiment, "-?\\d+") #? matches previous element (-) 0 or more times
    if(length(score_list[[1]]) != 0){ # we changed 1 to lengths(score_list) to have the last number found by re, because we have 17 and 19 in the score list
    computer_531[i,]$sentiment_score <- sum(as.numeric(as.character(unlist(score_list)[[lengths(score_list)]])))
  }
}
computer_531 <- computer_531 %>%
  mutate(sentiment = "neutral")

for(i in 1:nrow(computer_531)){ 
  if (as.numeric(computer_531[i,]$sentiment_score) < 0) {
      computer_531[i,]$sentiment <- "negative"
    }
    else if(as.numeric(computer_531[i,]$sentiment_score) > 0) {
      computer_531[i,]$sentiment <- "positive"
    }
}

head(computer_531)

write_csv(computer_531, "computer_531.csv")

Summary


  • Text mining packages

  • Different text file formats

The primary R functions for dealing with regular expressions are:

  • grep(), grepl(): Search for matches of a regular expression/pattern in a character vector

  • regexpr(), gregexpr(): Search a character vector for regular expression matches and return the indices where the match begins; useful in conjunction with regmatches()

  • sub(), gsub(): Search a character vector for regular expression matches and replace that match with another string

  • The stringr package provides a series of functions implementing much of the regular expression functionality in R but with a more consistent and rationalized interface.


End of Practical