Welcome to the fourth practical of the course “Introduction to Text Mining with R”. In this practical, we will analyze sentiments in a text data set, first by using dictionaries and then we will classify sentiment in text reviews comparing a dictionary-based method and a neural network model.
In this practical we are going to use the following packages:
library(tm)
library(tidytext)
library(dplyr)
library(ggplot2)
library(caret)
library(rpart)
library(rpart.plot)
taylor_swift_lyrics <- read.csv("data/taylor_swift_lyrics.csv")
lyrics
column of the ts
data frame. Complete the following preprocessing steps:docs <- Corpus(VectorSource(taylor_swift_lyrics$Lyrics))
dtm <- DocumentTermMatrix(docs,
control = list(tolower = TRUE,
removeNumbers = TRUE,
removePunctuation = TRUE,
stopwords = TRUE
))
# inspect(dtm)
dtm <- as.data.frame(as.matrix(dtm))
tidytext
package. Using the get_sentiments
function, load the “bing” and “afinn” dictionaries and store them in two objects called bing_sentiments
and afinn_sentiments
. You might need to install the “textdata” package.The tidytext package contains 4 general purpose lexicons in the sentiments dataset.
AFINN - listing of english words rated for valence between -5 and +5
bing - listing of positive and negative sentiment
nrc - list of English words and their associations with 8 emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and 2 sentiments (negative and positive); binary
loughran - list of sentiment words for accounting and finance by category (Negative, Positive, Uncertainty, Litigious, Strong Modal, Weak Modal, Constraining)
# also “nrc”, “loughran”
bing_sentiments <- get_sentiments("bing")
afinn_sentiments <- get_sentiments("afinn")
head(bing_sentiments)
head(afinn_sentiments)
bing_sentiments
called score
. This column should hold a “1” for positive words and “-1” for negative words.bing_sentiments$score <- ifelse(bing_sentiments$sentiment=="positive", 1, -1)
# get all the words in our dtm and put it in a dataframe
words <- data.frame(word = colnames(dtm))
head(words)
# get their sentiment scores
words <- merge(words, bing_sentiments, all.x = T)
head(words)
# replace NAs with 0s
words$score[is.na(words$score)] <- 0
head(words)
# calculate documents scores with matrix algebra!
scores <- as.matrix(dtm) %*% words$score
taylor_swift_lyrics
dataframe.taylor_swift_lyrics$sentiment_bing <- scores
head(taylor_swift_lyrics)
taylor_swift_lyrics[1:60,] %>% ggplot() +
geom_col(aes(Title, sentiment_bing), fill = "lightgreen", alpha=.75) +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5),
panel.grid.major = element_blank()) +
xlab("") +
ylab("Sentiment Score") +
ggtitle("Sentiment Analysis of Taylor Swift Lyrics") +
coord_flip() +
theme_minimal() +
theme(axis.text.y = element_text(size = 6))
taylor_swift_lyrics[61:132,] %>% ggplot() +
geom_col(aes(Title, sentiment_bing), fill = "lightgreen", alpha=.75) +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5),
panel.grid.major = element_blank()) +
xlab("") +
ylab("Sentiment Score") +
ggtitle("Sentiment Analysis of Taylor Swift Lyrics") +
coord_flip() +
theme_minimal() +
theme(axis.text.y = element_text(size = 6))
afinn
sentiment dictionary.sentiment_score <- function(texts, sentiment_dict){
# preprocess texts
docs <- Corpus(VectorSource(texts))
dtm <- DocumentTermMatrix(docs,
control = list(stopwords = T,
tolower = TRUE,
removeNumbers = TRUE,
removePunctuation = TRUE))
dtm <- as.data.frame(as.matrix(dtm))
# get all the words in dtm and put it in a dataframe
words <- data.frame(word = colnames(dtm))
# get their sentiment scores
words <- merge(words, sentiment_dict, all.x = T)
# replace NAs with 0s
# words$score[is.na(words$score)] <- 0
words$score[is.na(words$score)] <- 0
# calculate documents scores with matrix algebra!
scores <- as.matrix(dtm) %*% words$score
return(scores)
}
colnames(afinn_sentiments)[2] <- "score"
taylor_swift_lyrics$sentiment_afinn <- sentiment_score(taylor_swift_lyrics$Lyrics, afinn_sentiments)
taylor_swift_lyrics[1:60,] %>% ggplot() +
geom_col(aes(Title, sentiment_afinn), fill = "orange", alpha=.75) +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5),
panel.grid.major = element_blank()) +
xlab("") +
ylab("Sentiment Score") +
ggtitle("Sentiment Analysis of Taylor Swift Lyrics") +
coord_flip() +
theme_minimal() +
theme(axis.text.y = element_text(size = 6))
taylor_swift_lyrics[61:132,] %>% ggplot() +
geom_col(aes(Title, sentiment_afinn), fill = "orange", alpha=.75) +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5),
panel.grid.major = element_blank()) +
xlab("") +
ylab("Sentiment Score") +
ggtitle("Sentiment Analysis of Taylor Swift Lyrics") +
coord_flip() +
theme_minimal() +
theme(axis.text.y = element_text(size = 6))
# concatenate to make albums
albums <- taylor_swift_lyrics %>% group_by(Album) %>%
summarise(lyrics = paste0(Lyrics, collapse = ";"))
# add sentiments
albums$sentiment_bing <- sentiment_score(albums$lyrics, bing_sentiments)
# concatenate to make albums
# albums <- taylor_swift_lyrics %>% group_by(Album) %>%
# summarise(lyrics = paste0(Lyrics, collapse = ";"))
# add sentiments
albums$sentiment_afinn <- sentiment_score(albums$lyrics, afinn_sentiments)
albums %>% ggplot() +
geom_col(aes(Album, sentiment_bing), fill = "#edc948", alpha=.75) +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5),
panel.grid.major = element_blank()) +
xlab("") +
ylab("Sentiment Score") +
ggtitle("Sentiment Analysis of Taylor Swift Albums using the Bing dictionary") +
coord_flip() +
theme_minimal()
albums %>% ggplot() +
geom_col(aes(Album, sentiment_afinn), fill = "lightblue", alpha=.75) +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5),
panel.grid.major = element_blank()) +
xlab("") +
ylab("Sentiment Score") +
ggtitle("Sentiment Analysis of Taylor Swift Albums using the Afinn dictionary") +
coord_flip() +
theme_minimal()
In this part of the practical, we will do some sentiment analysis on computer product reviews. For this purpose, we will use our processed data set from the first practical: the computer_531 dataset.
computer_531
dataframe from the first practical and to read data from the ‘computer.txt’ file.
computer_531 <- read.csv("computer_531.csv")
sentiment_score
function on the reviews in the dataframe with both bing and afinn dictionaries.computer_531$sentiment_bing <- sentiment_score(computer_531$review, bing_sentiments)
computer_531$sentiment_afinn <- sentiment_score(computer_531$review, afinn_sentiments)
head(computer_531[,c("review", "sentiment", "sentiment_bing", "sentiment_afinn")], n = 10)
# converting numbers into neutral, positive and negative classes for the output of the bing dictionary
computer_531 <- computer_531 %>%
mutate(sentiment_bing1 = "neutral")
for(i in 1:nrow(computer_531)){
if(computer_531[i,]$sentiment_bing >= 1){
computer_531[i,]$sentiment_bing1 <- "positive"
} else if(computer_531[i,]$sentiment_bing <= -1){
computer_531[i,]$sentiment_bing1 <- "negative"
}
}
# performance of bing
confusionMatrix(table(computer_531$sentiment, computer_531$sentiment_bing1, dnn=c("Actual", "Predicted")))
## Confusion Matrix and Statistics
##
## Predicted
## Actual negative neutral positive
## negative 19 23 21
## neutral 60 133 99
## positive 23 89 64
##
## Overall Statistics
##
## Accuracy : 0.4068
## 95% CI : (0.3647, 0.4499)
## No Information Rate : 0.4614
## P-Value [Acc > NIR] : 0.9950122
##
## Kappa : 0.0253
##
## Mcnemar's Test P-Value : 0.0006687
##
## Statistics by Class:
##
## Class: negative Class: neutral Class: positive
## Sensitivity 0.18627 0.5429 0.3478
## Specificity 0.89744 0.4441 0.6772
## Pos Pred Value 0.30159 0.4555 0.3636
## Neg Pred Value 0.82265 0.5314 0.6620
## Prevalence 0.19209 0.4614 0.3465
## Detection Rate 0.03578 0.2505 0.1205
## Detection Prevalence 0.11864 0.5499 0.3315
## Balanced Accuracy 0.54186 0.4935 0.5125
# converting numbers into neutral, positive and negative classes for the output of the afinn dictionary
computer_531 <- computer_531 %>%
mutate(sentiment_afinn1 = "neutral")
for(i in 1:nrow(computer_531)){
if(computer_531[i,]$sentiment_afinn >= 1){
computer_531[i,]$sentiment_afinn1 <- "positive"
} else if(computer_531[i,]$sentiment_afinn <= -1){
computer_531[i,]$sentiment_afinn1 <- "negative"
}
}
# performance of afinn
confusionMatrix(table(computer_531$sentiment, computer_531$sentiment_afinn1, dnn=c("Actual", "Predicted")))
## Confusion Matrix and Statistics
##
## Predicted
## Actual negative neutral positive
## negative 10 33 20
## neutral 40 170 82
## positive 24 96 56
##
## Overall Statistics
##
## Accuracy : 0.4444
## 95% CI : (0.4017, 0.4879)
## No Information Rate : 0.5631
## P-Value [Acc > NIR] : 1.0000
##
## Kappa : 0.0341
##
## Mcnemar's Test P-Value : 0.5447
##
## Statistics by Class:
##
## Class: negative Class: neutral Class: positive
## Sensitivity 0.13514 0.5686 0.3544
## Specificity 0.88403 0.4741 0.6783
## Pos Pred Value 0.15873 0.5822 0.3182
## Neg Pred Value 0.86325 0.4603 0.7127
## Prevalence 0.13936 0.5631 0.2976
## Detection Rate 0.01883 0.3202 0.1055
## Detection Prevalence 0.11864 0.5499 0.3315
## Balanced Accuracy 0.50958 0.5213 0.5164
rpart
package, we want to build a tree on this dataset to predict sentiments. For this purpose, first prepare your data by doing preprocessing on reviews, converting to dtm and creating training and test sets.
set.seed(123)
corpus <- Corpus(VectorSource(computer_531$review))
# standardize to lowercase
corpus <- tm_map(corpus, content_transformer(tolower))
# remove tm stopwords
corpus <- tm_map(corpus, removeWords, stopwords())
# standardize whitespaces
corpus <- tm_map(corpus, stripWhitespace)
# remove punctuation
corpus <- tm_map(corpus, removePunctuation)
# remove numbers
corpus <- tm_map(corpus, removeNumbers)
dtm <- DocumentTermMatrix(corpus)
# words appearing more than 10x
features <- findFreqTerms(dtm, 10)
head(features)
## [1] "monitor" "purchase" "time" "quality" "screen" "see"
train_idx <- createDataPartition(computer_531$sentiment, p=0.80, list=FALSE)
# set for the original raw data
train1 <- computer_531[train_idx,]
test1 <- computer_531[-train_idx,]
# set for the cleaned-up data
train2 <- corpus[train_idx]
test2 <- corpus[-train_idx]
dtm_train <- DocumentTermMatrix(train2, list(dictionary=features))
dtm_test <- DocumentTermMatrix(test2, list(dictionary=features))
dtm_train <- as.data.frame(as.matrix(dtm_train))
dtm_test <- as.data.frame(as.matrix(dtm_test))
dtm_train1 <- cbind(cat=factor(train1$sentiment), dtm_train)
dtm_test1 <- cbind(cat=factor(test1$sentiment), dtm_test)
dtm_train1 <- as.data.frame(dtm_train1)
dtm_test1 <- as.data.frame(dtm_test1)
# build a small tree for prediction
fit_tree <- rpart(cat ~ ., data = dtm_train1, method = "class")
rpart.plot(fit_tree)
# performance on train
preds <- predict(fit_tree, type = c("class"))
confusionMatrix(table(dtm_train1$cat, preds, dnn=c("Actual", "Predicted")))
## Confusion Matrix and Statistics
##
## Predicted
## Actual negative neutral positive
## negative 0 38 13
## neutral 0 213 21
## positive 0 74 67
##
## Overall Statistics
##
## Accuracy : 0.6573
## 95% CI : (0.6101, 0.7023)
## No Information Rate : 0.7629
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3179
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: negative Class: neutral Class: positive
## Sensitivity NA 0.6554 0.6634
## Specificity 0.8803 0.7921 0.7723
## Pos Pred Value NA 0.9103 0.4752
## Neg Pred Value NA 0.4167 0.8807
## Prevalence 0.0000 0.7629 0.2371
## Detection Rate 0.0000 0.5000 0.1573
## Detection Prevalence 0.1197 0.5493 0.3310
## Balanced Accuracy NA 0.7237 0.7178
# performance on test
preds <- predict(fit_tree, newdata = dtm_test1, type = c("class"))
confusionMatrix(table(dtm_test1$cat, preds, dnn=c("Actual", "Predicted")))
## Confusion Matrix and Statistics
##
## Predicted
## Actual negative neutral positive
## negative 0 11 1
## neutral 0 53 5
## positive 0 23 12
##
## Overall Statistics
##
## Accuracy : 0.619
## 95% CI : (0.5191, 0.7121)
## No Information Rate : 0.8286
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2148
##
## Mcnemar's Test P-Value : 3.069e-05
##
## Statistics by Class:
##
## Class: negative Class: neutral Class: positive
## Sensitivity NA 0.6092 0.6667
## Specificity 0.8857 0.7222 0.7356
## Pos Pred Value NA 0.9138 0.3429
## Neg Pred Value NA 0.2766 0.9143
## Prevalence 0.0000 0.8286 0.1714
## Detection Rate 0.0000 0.5048 0.1143
## Detection Prevalence 0.1143 0.5524 0.3333
## Balanced Accuracy NA 0.6657 0.7011
Here you found that while decision trees could outperform a simple dictionary-based method, they are not very good with high dimensional data such as text. If you have some time left try a Random Forest, naive Bayes, or SVM to compare the performance. They will perform better, but you will lose interpretability!
In this practical, we learned about:
End of Practical