Introduction

Welcome to the fourth practical of the course “Introduction to Text Mining with R”. In this practical, we will analyze sentiments in a text data set, first by using dictionaries and then we will classify sentiment in text reviews comparing a dictionary-based method and a neural network model.

In this practical we are going to use the following packages:

library(tm)
library(tidytext)
library(dplyr)
library(ggplot2)
library(caret)
library(rpart)
library(rpart.plot)

Read data


  1. In today’s practical, we want to use Taylor Swift song lyrics data from all her albums. Read the “taylor_swift.csv” dataset from the data folder.

taylor_swift_lyrics <- read.csv("data/taylor_swift_lyrics.csv")

  1. First we must preprocess the corpus. Create a document-term matrix from the lyrics column of the ts data frame. Complete the following preprocessing steps:
  • convert to lower
  • remove stop words
  • remove numbers
  • remove punctuation.

docs <- Corpus(VectorSource(taylor_swift_lyrics$Lyrics))

dtm <- DocumentTermMatrix(docs,
           control = list(tolower = TRUE,
                          removeNumbers = TRUE,
                          removePunctuation = TRUE,
                          stopwords = TRUE
                         ))

  1. Inspect the dtm object and convert it to a dataframe.

# inspect(dtm)
dtm <- as.data.frame(as.matrix(dtm))

Sentiment dictionaries


  1. We’re going to use sentiment dictionaries from the tidytext package. Using the get_sentiments function, load the “bing” and “afinn” dictionaries and store them in two objects called bing_sentiments and afinn_sentiments. You might need to install the “textdata” package.

The tidytext package contains 4 general purpose lexicons in the sentiments dataset.

AFINN - listing of english words rated for valence between -5 and +5

bing - listing of positive and negative sentiment

nrc - list of English words and their associations with 8 emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and 2 sentiments (negative and positive); binary

loughran - list of sentiment words for accounting and finance by category (Negative, Positive, Uncertainty, Litigious, Strong Modal, Weak Modal, Constraining)


# also “nrc”, “loughran”

bing_sentiments  <- get_sentiments("bing")
afinn_sentiments <- get_sentiments("afinn")

head(bing_sentiments)
head(afinn_sentiments)

  1. The afinn_sentiments has the rating for valence between -5 and +5, while the bing_sentiments contains listing of positive and negative sentiment. Add a column to bing_sentiments called score. This column should hold a “1” for positive words and “-1” for negative words.

bing_sentiments$score <- ifelse(bing_sentiments$sentiment=="positive", 1, -1)

Sentiment score for each lyric


  1. Create a dataframe that holds all the words in the dtm object along with their sentiment score.

# get all the words in our dtm and put it in a dataframe
words <- data.frame(word = colnames(dtm))
head(words)

# get their sentiment scores
words <- merge(words, bing_sentiments, all.x = T)
head(words)

# replace NAs with 0s
words$score[is.na(words$score)] <- 0
head(words)

  1. To calculate a score for each lyric, multiply your dtm object by the scoring dataframe.


# calculate documents scores with matrix algebra! 
scores <- as.matrix(dtm) %*% words$score

  1. Add the calculated scores for each lyric to the original taylor_swift_lyrics dataframe.

taylor_swift_lyrics$sentiment_bing <- scores
head(taylor_swift_lyrics)

  1. Plot the bing sentiment scores for each lyric.

taylor_swift_lyrics[1:60,] %>% ggplot() +
  geom_col(aes(Title, sentiment_bing), fill = "lightgreen", alpha=.75) +
  theme(legend.position = "none", 
        plot.title = element_text(hjust = 0.5),
        panel.grid.major = element_blank()) +
  xlab("") + 
  ylab("Sentiment Score") +
  ggtitle("Sentiment Analysis of Taylor Swift Lyrics") +
  coord_flip() +
  theme_minimal() +
  theme(axis.text.y = element_text(size = 6))


taylor_swift_lyrics[61:132,] %>% ggplot() +
  geom_col(aes(Title, sentiment_bing), fill = "lightgreen", alpha=.75) +
  theme(legend.position = "none", 
        plot.title = element_text(hjust = 0.5),
        panel.grid.major = element_blank()) +
  xlab("") + 
  ylab("Sentiment Score") +
  ggtitle("Sentiment Analysis of Taylor Swift Lyrics") +
  coord_flip() +
  theme_minimal() +
  theme(axis.text.y = element_text(size = 6))


  1. Using the code you wrote above, below we made a function that gets 1) a vector of texts, and 2) a sentiment dictionary (i.e. a dataframe with words and scores), and returns a vector of sentiment scores for each text. Use this function to repeat your analysis with the afinn sentiment dictionary.

sentiment_score <- function(texts, sentiment_dict){
  # preprocess texts
  docs <- Corpus(VectorSource(texts))
  dtm <- DocumentTermMatrix(docs,
           control = list(stopwords = T,
                          tolower = TRUE,
                          removeNumbers = TRUE,
                          removePunctuation = TRUE))
  dtm <- as.data.frame(as.matrix(dtm))
  
  # get all the words in dtm and put it in a dataframe
  words <- data.frame(word = colnames(dtm))

  # get their sentiment scores
  words <- merge(words, sentiment_dict, all.x = T)

  # replace NAs with 0s
  # words$score[is.na(words$score)] <- 0
  words$score[is.na(words$score)] <- 0
  
  # calculate documents scores with matrix algebra!
  scores <- as.matrix(dtm) %*% words$score
  
  return(scores)
}
colnames(afinn_sentiments)[2] <- "score"
taylor_swift_lyrics$sentiment_afinn <- sentiment_score(taylor_swift_lyrics$Lyrics, afinn_sentiments)
taylor_swift_lyrics[1:60,] %>% ggplot() +
  geom_col(aes(Title, sentiment_afinn), fill = "orange", alpha=.75) +
  theme(legend.position = "none", 
        plot.title = element_text(hjust = 0.5),
        panel.grid.major = element_blank()) +
  xlab("") + 
  ylab("Sentiment Score") +
  ggtitle("Sentiment Analysis of Taylor Swift Lyrics") +
  coord_flip() +
  theme_minimal() +
  theme(axis.text.y = element_text(size = 6))


taylor_swift_lyrics[61:132,] %>% ggplot() +
  geom_col(aes(Title, sentiment_afinn), fill = "orange", alpha=.75) +
  theme(legend.position = "none", 
        plot.title = element_text(hjust = 0.5),
        panel.grid.major = element_blank()) +
  xlab("") + 
  ylab("Sentiment Score") +
  ggtitle("Sentiment Analysis of Taylor Swift Lyrics") +
  coord_flip() +
  theme_minimal() +
  theme(axis.text.y = element_text(size = 6))


  1. Compare the bing and afinn dictionaries by finding out which the most and least positive Taylor Swift album is. You can also plot the sentiments for albums.

# concatenate to make albums
albums <- taylor_swift_lyrics %>% group_by(Album) %>%
  summarise(lyrics = paste0(Lyrics, collapse = ";"))

# add sentiments
albums$sentiment_bing <- sentiment_score(albums$lyrics, bing_sentiments)

# concatenate to make albums
# albums <- taylor_swift_lyrics %>% group_by(Album) %>%
#   summarise(lyrics = paste0(Lyrics, collapse = ";"))

# add sentiments
albums$sentiment_afinn <- sentiment_score(albums$lyrics, afinn_sentiments)

albums %>% ggplot() +
  geom_col(aes(Album, sentiment_bing), fill = "#edc948", alpha=.75) +
  theme(legend.position = "none", 
        plot.title = element_text(hjust = 0.5),
        panel.grid.major = element_blank()) +
  xlab("") + 
  ylab("Sentiment Score") +
  ggtitle("Sentiment Analysis of Taylor Swift Albums using the Bing dictionary") +
  coord_flip() +
  theme_minimal()


albums %>% ggplot() +
  geom_col(aes(Album, sentiment_afinn), fill = "lightblue", alpha=.75) +
  theme(legend.position = "none", 
        plot.title = element_text(hjust = 0.5),
        panel.grid.major = element_blank()) +
  xlab("") + 
  ylab("Sentiment Score") +
  ggtitle("Sentiment Analysis of Taylor Swift Albums using the Afinn dictionary") +
  coord_flip() +
  theme_minimal()


Sentiment analysis of product reviews

In this part of the practical, we will do some sentiment analysis on computer product reviews. For this purpose, we will use our processed data set from the first practical: the computer_531 dataset.


  1. Load the final computer_531 dataframe from the first practical and to read data from the ‘computer.txt’ file.


computer_531 <- read.csv("computer_531.csv")

  1. Apply the sentiment_score function on the reviews in the dataframe with both bing and afinn dictionaries.

computer_531$sentiment_bing  <- sentiment_score(computer_531$review, bing_sentiments)
computer_531$sentiment_afinn <- sentiment_score(computer_531$review, afinn_sentiments)

head(computer_531[,c("review", "sentiment", "sentiment_bing", "sentiment_afinn")], n = 10)

  1. Create a confusion matrix, and calculate Accuracy, precision, recall and F1 measures for the output of the bing dictionary.

# converting numbers into neutral, positive and negative classes for the output of the bing dictionary
computer_531 <- computer_531 %>%
  mutate(sentiment_bing1 = "neutral")

for(i in 1:nrow(computer_531)){ 
  if(computer_531[i,]$sentiment_bing >= 1){
    computer_531[i,]$sentiment_bing1 <- "positive"
    } else if(computer_531[i,]$sentiment_bing <= -1){
    computer_531[i,]$sentiment_bing1 <- "negative"
    }
}

# performance of bing
confusionMatrix(table(computer_531$sentiment, computer_531$sentiment_bing1, dnn=c("Actual", "Predicted")))
## Confusion Matrix and Statistics
## 
##           Predicted
## Actual     negative neutral positive
##   negative       19      23       21
##   neutral        60     133       99
##   positive       23      89       64
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4068          
##                  95% CI : (0.3647, 0.4499)
##     No Information Rate : 0.4614          
##     P-Value [Acc > NIR] : 0.9950122       
##                                           
##                   Kappa : 0.0253          
##                                           
##  Mcnemar's Test P-Value : 0.0006687       
## 
## Statistics by Class:
## 
##                      Class: negative Class: neutral Class: positive
## Sensitivity                  0.18627         0.5429          0.3478
## Specificity                  0.89744         0.4441          0.6772
## Pos Pred Value               0.30159         0.4555          0.3636
## Neg Pred Value               0.82265         0.5314          0.6620
## Prevalence                   0.19209         0.4614          0.3465
## Detection Rate               0.03578         0.2505          0.1205
## Detection Prevalence         0.11864         0.5499          0.3315
## Balanced Accuracy            0.54186         0.4935          0.5125

  1. Create a confusion matrix, and calculate Accuracy, precision, recall and F1 measures for the output of the afinn dictionary.

# converting numbers into neutral, positive and negative classes for the output of the afinn dictionary
computer_531 <- computer_531 %>%
  mutate(sentiment_afinn1 = "neutral")
for(i in 1:nrow(computer_531)){ 
  if(computer_531[i,]$sentiment_afinn >= 1){
    computer_531[i,]$sentiment_afinn1 <- "positive"
    } else if(computer_531[i,]$sentiment_afinn <= -1){
    computer_531[i,]$sentiment_afinn1 <- "negative"
    }
}

# performance of afinn
confusionMatrix(table(computer_531$sentiment, computer_531$sentiment_afinn1, dnn=c("Actual", "Predicted")))
## Confusion Matrix and Statistics
## 
##           Predicted
## Actual     negative neutral positive
##   negative       10      33       20
##   neutral        40     170       82
##   positive       24      96       56
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4444          
##                  95% CI : (0.4017, 0.4879)
##     No Information Rate : 0.5631          
##     P-Value [Acc > NIR] : 1.0000          
##                                           
##                   Kappa : 0.0341          
##                                           
##  Mcnemar's Test P-Value : 0.5447          
## 
## Statistics by Class:
## 
##                      Class: negative Class: neutral Class: positive
## Sensitivity                  0.13514         0.5686          0.3544
## Specificity                  0.88403         0.4741          0.6783
## Pos Pred Value               0.15873         0.5822          0.3182
## Neg Pred Value               0.86325         0.4603          0.7127
## Prevalence                   0.13936         0.5631          0.2976
## Detection Rate               0.01883         0.3202          0.1055
## Detection Prevalence         0.11864         0.5499          0.3315
## Balanced Accuracy            0.50958         0.5213          0.5164

  1. From the rpart package, we want to build a tree on this dataset to predict sentiments. For this purpose, first prepare your data by doing preprocessing on reviews, converting to dtm and creating training and test sets.


set.seed(123)

corpus <- Corpus(VectorSource(computer_531$review))
# standardize to lowercase
corpus <- tm_map(corpus, content_transformer(tolower))
# remove tm stopwords
corpus <- tm_map(corpus, removeWords, stopwords())
# standardize whitespaces
corpus <- tm_map(corpus, stripWhitespace)
# remove punctuation
corpus <- tm_map(corpus, removePunctuation)
# remove numbers
corpus <- tm_map(corpus, removeNumbers)
dtm <- DocumentTermMatrix(corpus)
# words appearing more than 10x
features <- findFreqTerms(dtm, 10)
head(features)
## [1] "monitor"  "purchase" "time"     "quality"  "screen"   "see"

train_idx <- createDataPartition(computer_531$sentiment, p=0.80, list=FALSE)
# set for the original raw data 
train1    <- computer_531[train_idx,]
test1     <- computer_531[-train_idx,]
# set for the cleaned-up data
train2    <- corpus[train_idx]
test2     <- corpus[-train_idx]


dtm_train <- DocumentTermMatrix(train2, list(dictionary=features))
dtm_test  <- DocumentTermMatrix(test2, list(dictionary=features))
dtm_train <- as.data.frame(as.matrix(dtm_train))
dtm_test  <- as.data.frame(as.matrix(dtm_test))

dtm_train1 <- cbind(cat=factor(train1$sentiment), dtm_train)
dtm_test1  <- cbind(cat=factor(test1$sentiment), dtm_test)
dtm_train1 <- as.data.frame(dtm_train1)
dtm_test1  <- as.data.frame(dtm_test1)

  1. Now build a tree with the default setting of the rpart function, and visualize the tree.

# build a small tree for prediction
fit_tree <- rpart(cat ~ ., data = dtm_train1, method = "class")
rpart.plot(fit_tree)


  1. Check the performance of the tree on your training set.

# performance on train
preds <- predict(fit_tree, type = c("class"))
confusionMatrix(table(dtm_train1$cat, preds, dnn=c("Actual", "Predicted")))
## Confusion Matrix and Statistics
## 
##           Predicted
## Actual     negative neutral positive
##   negative        0      38       13
##   neutral         0     213       21
##   positive        0      74       67
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6573          
##                  95% CI : (0.6101, 0.7023)
##     No Information Rate : 0.7629          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3179          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
## 
## Statistics by Class:
## 
##                      Class: negative Class: neutral Class: positive
## Sensitivity                       NA         0.6554          0.6634
## Specificity                   0.8803         0.7921          0.7723
## Pos Pred Value                    NA         0.9103          0.4752
## Neg Pred Value                    NA         0.4167          0.8807
## Prevalence                    0.0000         0.7629          0.2371
## Detection Rate                0.0000         0.5000          0.1573
## Detection Prevalence          0.1197         0.5493          0.3310
## Balanced Accuracy                 NA         0.7237          0.7178

  1. Check the performance of the tree on the test set, and compare it with the training performance.

# performance on test
preds <- predict(fit_tree, newdata = dtm_test1, type = c("class"))
confusionMatrix(table(dtm_test1$cat, preds, dnn=c("Actual", "Predicted")))
## Confusion Matrix and Statistics
## 
##           Predicted
## Actual     negative neutral positive
##   negative        0      11        1
##   neutral         0      53        5
##   positive        0      23       12
## 
## Overall Statistics
##                                           
##                Accuracy : 0.619           
##                  95% CI : (0.5191, 0.7121)
##     No Information Rate : 0.8286          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2148          
##                                           
##  Mcnemar's Test P-Value : 3.069e-05       
## 
## Statistics by Class:
## 
##                      Class: negative Class: neutral Class: positive
## Sensitivity                       NA         0.6092          0.6667
## Specificity                   0.8857         0.7222          0.7356
## Pos Pred Value                    NA         0.9138          0.3429
## Neg Pred Value                    NA         0.2766          0.9143
## Prevalence                    0.0000         0.8286          0.1714
## Detection Rate                0.0000         0.5048          0.1143
## Detection Prevalence          0.1143         0.5524          0.3333
## Balanced Accuracy                 NA         0.6657          0.7011

Here you found that while decision trees could outperform a simple dictionary-based method, they are not very good with high dimensional data such as text. If you have some time left try a Random Forest, naive Bayes, or SVM to compare the performance. They will perform better, but you will lose interpretability!


Summary


In this practical, we learned about:

  • Sentiment analysis
  • Dictionary-based methods
  • Available sentiment dictionaries
  • Learning sentiments
  • Decision trees

End of Practical