Welcome to the third practical of the course “Introduction to Text Mining with R”. In this practical we will start with creating a pipeline for text classification using simple training and test sets. In the second part of the practical, we will train machine learning techniques on a news dataset and we will compare them.
In this practical we are going to use the following packages:
library(caret)
library(tidyverse)
library(tm)
library(e1071)
toy_data_train <- c('Cats like to chase mice.',
'Dogs like to eat big bones.')
# convert data to vector space model
toy_corpus_train <- VCorpus(VectorSource(toy_data_train))
# create a dtm object
toy_dtm_train <- DocumentTermMatrix(toy_corpus_train,
list(removePunctuation = TRUE,
stopwords = TRUE,
stemming = TRUE,
removeNumbers = TRUE))
inspect(toy_dtm_train)
## <<DocumentTermMatrix (documents: 2, terms: 8)>>
## Non-/sparse entries: 9/7
## Sparsity : 44%
## Maximal term length: 5
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs big bone cat chase dog eat like mice
## 1 0 0 1 1 0 0 1 1
## 2 1 1 0 0 1 1 1 0
#inspect(toy_dtm_train[1:rownumber, 1:colnumber])
toy_data_train <- cbind(as.matrix(toy_dtm_train), c(0, 1))
colnames(toy_data_train)[ncol(toy_data_train)] <- 'y'
toy_data_train <- as.data.frame(toy_data_train)
toy_data_train$y <- as.factor(toy_data_train$y)
toy_data_train
fit_toy <- train(y ~ ., data = toy_data_train, method = 'bayesglm')
summary(fit_toy)
##
## Call:
## NULL
##
## Deviance Residuals:
## [1] 0 0
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.3102 4.6597 -0.067 0.947
## big 0.6204 2.3431 0.265 0.791
## bone 0.6204 2.3431 0.265 0.791
## cat -0.6204 2.3431 -0.265 0.791
## chase -0.6204 2.3431 -0.265 0.791
## dog 0.6204 2.3431 0.265 0.791
## eat 0.6204 2.3431 0.265 0.791
## like 0.0000 2.5000 0.000 1.000
## mice -0.6204 2.3431 -0.265 0.791
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2.77259 on 1 degrees of freedom
## Residual deviance: 0.43185 on -7 degrees of freedom
## AIC: 18.432
##
## Number of Fisher Scoring iterations: 20
predict(fit_toy)
## [1] 0 1
## Levels: 0 1
Here, you see that the model perfectly predicts the category for our training data.
toy_test_sentence <- c('Rats eat cheese.')
# convert to vector space model
toy_corpus_test <- VCorpus(VectorSource(toy_test_sentence))
# convert to a dtm
toy_dtm_test <- DocumentTermMatrix(toy_corpus_test,
control = list(dictionary = Terms(toy_dtm_train), # remember that we need to use the terminology from the training data
removePunctuation = TRUE,
stopwords = TRUE,
stemming = TRUE,
removeNumbers = TRUE))
toy_test_data <- as.matrix(toy_dtm_test)
# predict category for the test data
predict(fit_toy, newdata = toy_test_data)
## [1] 1
## Levels: 0 1
# create a dtm object
tou_dtm_train_tfidf <- DocumentTermMatrix(toy_corpus_train,
list(weighting = weightTfIdf,
removePunctuation = TRUE,
stopwords = TRUE,
stemming = TRUE,
removeNumbers = TRUE))
# create the y variable
toy_train_data_tfidf <-cbind(as.matrix(tou_dtm_train_tfidf), c(0, 1))
colnames(toy_train_data_tfidf)[ncol(toy_train_data_tfidf)] <- 'y'
toy_train_data_tfidf <- as.data.frame(toy_train_data_tfidf)
toy_train_data_tfidf$y <- as.factor(toy_train_data_tfidf$y)
toy_train_data_tfidf
# fit the naive Bayes model with the tfidf representation
toy_fit_tfidf <- train(y ~ ., data = toy_train_data_tfidf, method = 'bayesglm')
summary(toy_fit_tfidf)
##
## Call:
## NULL
##
## Deviance Residuals:
## [1] 0 0
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.3102 3.9294 -0.079 0.937
## big 3.1021 11.7155 0.265 0.791
## bone 3.1021 11.7155 0.265 0.791
## cat -2.4817 9.3724 -0.265 0.791
## chase -2.4817 9.3724 -0.265 0.791
## dog 3.1021 11.7155 0.265 0.791
## eat 3.1021 11.7155 0.265 0.791
## like 0.0000 2.5000 0.000 1.000
## mice -2.4817 9.3724 -0.265 0.791
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2.77259 on 1 degrees of freedom
## Residual deviance: 0.43185 on -7 degrees of freedom
## AIC: 18.432
##
## Number of Fisher Scoring iterations: 20
# prediction on training data
predict(toy_fit_tfidf)
## [1] 0 1
## Levels: 0 1
# prediction on test data
# convert to a dtm
tou_dtm_test_tfidf <- DocumentTermMatrix(toy_corpus_test,
control = list(dictionary = Terms(tou_dtm_train_tfidf), # remember that we need to use the vocabulary from the training data
weighting = weightTfIdf,
removePunctuation = TRUE,
stopwords = TRUE,
stemming = TRUE,
removeNumbers = TRUE))
toy_test_data_tfidf <- as.matrix(tou_dtm_test_tfidf)
# predict category for the test data
predict(toy_fit_tfidf, newdata = toy_test_data_tfidf)
## [1] 0
## Levels: 0 1
The data set used in this part of the practical is the BBC News data set. The raw data set can be downloaded from here. It consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004 to 2005. These areas are:
Note that the loaded data set object is called df_final
.
load("data/news_dataset.rda")
head(df_final)
#can take a subset (e.g. 20% of the original data set) to reduce computational time
set.seed(123)
df_final <- sample_frac(df_final, 0.2)
# list of the categories in the data set
unique(df_final$Category)
## [1] "entertainment" "business" "tech" "politics"
## [5] "sport"
table(df_final$Category)
##
## business entertainment politics sport tech
## 95 69 85 108 88
corpus_BBC <- Corpus(VectorSource(df_final$Content))
# standardize to lowercase
corpus_BBC <- tm_map(corpus_BBC, content_transformer(tolower))
# returns error invalid multibyte string 1512
# Two possible solutions:
# 1) Omission of the whole observation
# df_final <- df_final[-1512,]
# 2) Omission of the string "\xa315.8m" from that specifc text
# df_final[1512,][2] <- gsub("\xa315.8m", "", df_final[1512,][2])
# then rerun from line 255.
# remove tm stopwords
corpus_BBC <- tm_map(corpus_BBC, removeWords, stopwords())
# standardize whitespaces
corpus_BBC <- tm_map(corpus_BBC, stripWhitespace)
# remove punctuation
corpus_BBC <- tm_map(corpus_BBC, removePunctuation)
# remove numbers
corpus_BBC <- tm_map(corpus_BBC, removeNumbers)
dtm_BBC <- DocumentTermMatrix(corpus_BBC)
# words appearing more than 10x
features_BBC <- findFreqTerms(dtm_BBC, 10)
head(features_BBC)
## [1] "able" "achieved" "age" "ago" "air" "alongside"
## set the seed to make your partition reproducible
set.seed(123)
train_idx <- createDataPartition(df_final$Category, p=0.80, list=FALSE)
# set for the original raw data
train_BBC_raw <- df_final[train_idx,]
test_BBC_raw <- df_final[-train_idx,]
# set for the cleaned-up data
train_BBC_clean <- corpus_BBC[train_idx]
test_BBC_clean <- corpus_BBC[-train_idx]
dtm_BBC_train <- DocumentTermMatrix(train_BBC_clean, list(dictionary=features_BBC))
dtm_BBC_test <- DocumentTermMatrix(test_BBC_clean, list(dictionary=features_BBC))
dtm_BBC_train <- as.data.frame(as.matrix(dtm_BBC_train))
dtm_BBC_test <- as.data.frame(as.matrix(dtm_BBC_test))
str(dtm_BBC_test)
## 'data.frame': 87 obs. of 2025 variables:
## $ able : num 1 0 0 0 0 0 0 0 0 0 ...
## $ achieved : num 1 0 0 0 0 0 0 0 0 0 ...
## $ age : num 4 0 0 0 0 0 0 1 0 0 ...
## $ ago : num 1 0 0 0 0 0 0 0 0 0 ...
## $ air : num 1 0 0 0 1 0 0 0 0 0 ...
## $ alongside : num 1 0 0 0 0 0 0 0 0 0 ...
## $ ambition : num 1 0 0 0 0 0 0 1 0 0 ...
## $ april : num 2 0 0 0 0 0 0 0 0 0 ...
## $ around : num 1 0 0 0 0 0 0 0 0 0 ...
## $ attention : num 1 0 0 0 0 0 0 1 0 0 ...
## $ attract : num 1 0 0 0 0 0 0 1 0 0 ...
## $ became : num 1 0 0 0 0 0 1 0 0 0 ...
## $ become : num 1 1 0 0 0 0 0 0 0 0 ...
## $ began : num 1 0 0 0 0 0 0 0 0 0 ...
## $ beyond : num 1 1 0 0 0 0 0 0 0 0 ...
## $ book : num 4 0 0 0 0 0 0 0 0 0 ...
## $ bought : num 1 0 0 0 0 0 0 0 0 0 ...
## $ broadcasting : num 1 0 0 0 0 0 0 0 0 0 ...
## $ calling : num 1 0 0 0 1 0 0 0 0 0 ...
## $ came : num 1 0 0 1 0 0 0 0 0 0 ...
## $ certainly : num 1 0 0 0 0 0 0 0 0 0 ...
## $ child : num 1 0 0 0 0 1 0 0 0 0 ...
## $ children : num 1 0 0 0 0 5 0 0 0 0 ...
## $ comedy : num 1 0 0 0 0 0 0 0 0 0 ...
## $ continued : num 1 0 0 0 0 0 0 0 0 0 ...
## $ countries : num 1 0 0 1 0 0 0 0 0 0 ...
## $ couple : num 1 0 0 0 0 0 0 0 0 0 ...
## $ critics : num 1 0 0 1 0 0 0 1 0 0 ...
## $ debut : num 1 0 0 0 0 0 0 0 0 0 ...
## $ decided : num 1 0 0 0 0 0 0 0 0 0 ...
## $ despite : num 1 2 0 1 0 0 0 0 0 0 ...
## $ died : num 2 0 0 0 0 0 0 0 0 0 ...
## $ disaster : num 2 0 0 0 0 0 0 0 0 0 ...
## $ early : num 1 0 0 0 0 0 0 1 0 0 ...
## $ east : num 1 0 0 0 0 0 0 0 0 0 ...
## $ events : num 1 0 0 0 0 1 0 0 0 0 ...
## $ eventually : num 1 0 0 0 0 0 0 0 0 0 ...
## $ feature : num 1 0 0 0 0 0 0 0 0 0 ...
## $ fell : num 1 0 1 0 0 0 0 0 0 0 ...
## $ film : num 2 0 0 0 0 0 0 0 0 0 ...
## $ final : num 2 0 0 0 0 0 0 1 5 0 ...
## $ finance : num 1 1 0 0 0 0 0 0 0 0 ...
## $ first : num 2 0 0 1 0 0 0 1 1 0 ...
## $ flight : num 4 0 0 0 5 0 0 0 0 0 ...
## $ flying : num 1 0 0 0 1 0 0 0 0 0 ...
## $ followed : num 3 0 0 0 0 1 0 0 0 0 ...
## $ following : num 1 0 0 0 0 0 0 1 0 0 ...
## $ follows : num 1 0 0 0 0 0 0 0 0 0 ...
## $ force : num 1 0 0 0 0 0 0 1 0 0 ...
## $ former : num 1 0 0 0 0 0 0 0 0 0 ...
## $ four : num 1 0 1 0 0 0 1 1 1 0 ...
## $ full : num 1 0 0 0 0 1 0 0 0 0 ...
## $ got : num 1 0 0 0 0 1 0 0 0 0 ...
## $ greatest : num 1 0 0 0 0 0 0 0 0 0 ...
## $ ground : num 1 0 0 0 1 0 0 0 0 0 ...
## $ happen : num 1 0 0 0 0 0 0 0 0 0 ...
## $ health : num 1 0 0 0 0 1 1 0 0 0 ...
## $ heart : num 1 0 0 0 0 0 1 0 0 0 ...
## $ high : num 2 0 0 0 0 0 0 2 0 0 ...
## $ hit : num 2 0 1 0 0 0 0 0 0 0 ...
## $ hollywood : num 2 0 0 0 0 0 0 0 0 0 ...
## $ home : num 1 0 0 1 0 0 1 0 0 0 ...
## $ hotel : num 1 0 0 0 0 0 0 0 0 0 ...
## $ industry : num 1 0 0 0 2 0 0 1 0 0 ...
## $ james : num 1 0 0 0 0 0 0 0 0 0 ...
## $ just : num 1 0 0 2 0 0 0 1 0 2 ...
## $ keep : num 1 0 0 0 0 0 0 0 0 0 ...
## $ king : num 1 0 0 0 0 0 0 0 0 0 ...
## $ known : num 2 0 0 0 0 0 0 0 0 0 ...
## $ later : num 4 0 0 0 0 1 0 0 0 0 ...
## $ life : num 2 0 0 0 0 1 0 0 0 0 ...
## $ lists : num 1 0 0 0 0 0 0 0 0 0 ...
## $ manager : num 1 0 0 0 0 0 0 0 0 0 ...
## $ martin : num 1 0 0 0 0 0 0 0 0 0 ...
## $ middle : num 1 0 0 0 0 1 0 0 0 0 ...
## $ months : num 1 0 1 0 0 0 0 1 0 1 ...
## $ moved : num 1 0 0 0 0 0 0 0 0 0 ...
## $ movie : num 2 0 0 0 0 0 0 0 0 0 ...
## $ name : num 1 0 0 0 0 0 0 0 0 0 ...
## $ national : num 1 1 0 1 0 0 0 0 0 0 ...
## $ never : num 2 0 0 0 0 0 0 1 0 0 ...
## $ new : num 1 0 0 0 0 0 1 2 1 0 ...
## $ nine : num 1 0 0 0 0 0 0 0 0 0 ...
## $ parents : num 1 0 0 0 0 4 0 0 0 0 ...
## $ places : num 1 0 0 0 0 1 0 0 0 0 ...
## $ popular : num 1 0 0 0 0 0 0 1 0 0 ...
## $ popularity : num 1 0 0 0 0 0 0 0 0 0 ...
## $ produce : num 2 0 0 0 0 0 0 0 0 0 ...
## $ producers : num 1 0 0 0 0 0 0 0 0 0 ...
## $ prompted : num 1 0 0 0 0 0 0 0 0 0 ...
## $ proved : num 1 0 1 0 0 0 1 0 0 0 ...
## $ published : num 1 0 0 0 0 0 0 0 0 0 ...
## $ quickly : num 1 0 1 0 0 0 1 0 0 0 ...
## $ real : num 1 0 0 0 0 0 0 0 0 0 ...
## $ recent : num 1 1 0 0 0 0 1 3 0 0 ...
## $ remains : num 1 0 0 0 0 0 0 2 0 0 ...
## $ research : num 1 1 0 0 0 0 0 0 0 0 ...
## $ response : num 1 0 1 0 0 0 1 0 0 0 ...
## $ return : num 1 0 0 0 0 0 0 0 0 0 ...
## [list output truncated]
freq <- colSums(dtm_BBC_train)
high_freq <- tail(sort(freq), n = 20)
hfp_df <- as.data.frame(sort(high_freq))
hfp_df$names <- rownames(hfp_df)
ggplot(hfp_df, aes(reorder(names, high_freq), high_freq)) +
geom_bar(stat="identity") +
coord_flip() +
xlab("Terms") +
ylab("Frequency") +
ggtitle("Term frequencies")
dtm_BBC_train <- cbind(cat=factor(train_BBC_raw$Category), dtm_BBC_train)
dtm_BBC_test <- cbind(cat=factor(test_BBC_raw$Category), dtm_BBC_test)
dtm_BBC_train<-as.data.frame(dtm_BBC_train)
dtm_BBC_test<-as.data.frame(dtm_BBC_test)
# gc()
# Naive Bayes with Laplace smoothing
fit_BBC_nb <- naiveBayes(cat~., data=dtm_BBC_train, laplace = 1)
summary(fit_BBC_nb)
## Length Class Mode
## apriori 5 table numeric
## tables 2025 -none- list
## levels 5 -none- character
## isnumeric 2025 -none- logical
## call 4 -none- call
pred_BBC_nb_train <- predict(fit_BBC_nb, na.omit(dtm_BBC_train))
fit_BBC_nb_table <- table(na.omit(dtm_BBC_train$cat), pred_BBC_nb_train, dnn=c("Actual", "Predicted"))
fit_BBC_nb_table
## Predicted
## Actual business entertainment politics sport tech
## business 29 47 0 0 0
## entertainment 0 56 0 0 0
## politics 0 46 21 1 0
## sport 0 41 0 46 0
## tech 0 47 1 3 20
The predict function allows you to specify whether you want the most probable class or if you want to get the probability for every class. Nothing changes with the exception being the type parameter is set to “raw”.
fit_BBC_svm <- svm(cat~., data=dtm_BBC_train)
summary(fit_BBC_svm)
##
## Call:
## svm(formula = cat ~ ., data = dtm_BBC_train)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 336
##
## ( 63 60 56 83 74 )
##
##
## Number of Classes: 5
##
## Levels:
## business entertainment politics sport tech
pred_BBC_svm_train <- predict(fit_BBC_svm, na.omit(dtm_BBC_train))
fit_BBC_svm_table <- table(na.omit(dtm_BBC_train$cat), pred_BBC_svm_train, dnn=c("Actual", "Predicted"))
fit_BBC_svm_table
## Predicted
## Actual business entertainment politics sport tech
## business 75 0 0 1 0
## entertainment 5 30 0 21 0
## politics 2 0 55 11 0
## sport 0 0 0 87 0
## tech 1 0 0 9 61
# prediction on test data
pred_BBC_nb_test <- predict(fit_BBC_nb, na.omit(dtm_BBC_test))
fit_BBC_nb_test <- table(na.omit(dtm_BBC_test$cat), pred_BBC_nb_test, dnn=c("Actual", "Predicted"))
fit_BBC_nb_test
## Predicted
## Actual business entertainment politics sport tech
## business 3 15 1 0 0
## entertainment 0 13 0 0 0
## politics 0 13 2 2 0
## sport 0 11 0 10 0
## tech 0 15 0 1 1
# prediction on test data
pred_BBC_svm_test <- predict(fit_BBC_svm, na.omit(dtm_BBC_test))
fit_BBC_svm_test <- table(na.omit(dtm_BBC_test$cat), pred_BBC_svm_test, dnn=c("Actual", "Predicted"))
fit_BBC_svm_test
## Predicted
## Actual business entertainment politics sport tech
## business 14 0 2 2 1
## entertainment 1 3 0 9 0
## politics 2 0 13 2 0
## sport 0 0 0 21 0
## tech 1 0 0 5 11
The baseline performance (accuracy) is:
table(dtm_BBC_train$cat)
##
## business entertainment politics sport tech
## 76 56 68 87 71
cat("\n The baseline prediction accuracy: ")
##
## The baseline prediction accuracy:
mean(na.omit(dtm_BBC_test$cat) == "sport")
## [1] 0.2413793
confusionMatrix(fit_BBC_nb_test)
## Confusion Matrix and Statistics
##
## Predicted
## Actual business entertainment politics sport tech
## business 3 15 1 0 0
## entertainment 0 13 0 0 0
## politics 0 13 2 2 0
## sport 0 11 0 10 0
## tech 0 15 0 1 1
##
## Overall Statistics
##
## Accuracy : 0.3333
## 95% CI : (0.2358, 0.4425)
## No Information Rate : 0.7701
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.199
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: business Class: entertainment Class: politics
## Sensitivity 1.00000 0.1940 0.66667
## Specificity 0.80952 1.0000 0.82143
## Pos Pred Value 0.15789 1.0000 0.11765
## Neg Pred Value 1.00000 0.2703 0.98571
## Prevalence 0.03448 0.7701 0.03448
## Detection Rate 0.03448 0.1494 0.02299
## Detection Prevalence 0.21839 0.1494 0.19540
## Balanced Accuracy 0.90476 0.5970 0.74405
## Class: sport Class: tech
## Sensitivity 0.7692 1.00000
## Specificity 0.8514 0.81395
## Pos Pred Value 0.4762 0.05882
## Neg Pred Value 0.9545 1.00000
## Prevalence 0.1494 0.01149
## Detection Rate 0.1149 0.01149
## Detection Prevalence 0.2414 0.19540
## Balanced Accuracy 0.8103 0.90698
confusionMatrix(fit_BBC_svm_test)
## Confusion Matrix and Statistics
##
## Predicted
## Actual business entertainment politics sport tech
## business 14 0 2 2 1
## entertainment 1 3 0 9 0
## politics 2 0 13 2 0
## sport 0 0 0 21 0
## tech 1 0 0 5 11
##
## Overall Statistics
##
## Accuracy : 0.7126
## 95% CI : (0.6057, 0.8046)
## No Information Rate : 0.4483
## P-Value [Acc > NIR] : 5.473e-07
##
## Kappa : 0.632
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: business Class: entertainment Class: politics
## Sensitivity 0.7778 1.00000 0.8667
## Specificity 0.9275 0.88095 0.9444
## Pos Pred Value 0.7368 0.23077 0.7647
## Neg Pred Value 0.9412 1.00000 0.9714
## Prevalence 0.2069 0.03448 0.1724
## Detection Rate 0.1609 0.03448 0.1494
## Detection Prevalence 0.2184 0.14943 0.1954
## Balanced Accuracy 0.8527 0.94048 0.9056
## Class: sport Class: tech
## Sensitivity 0.5385 0.9167
## Specificity 1.0000 0.9200
## Pos Pred Value 1.0000 0.6471
## Neg Pred Value 0.7273 0.9857
## Prevalence 0.4483 0.1379
## Detection Rate 0.2414 0.1264
## Detection Prevalence 0.2414 0.1954
## Balanced Accuracy 0.7692 0.9183
In this practical, we learned about:
Here we have some additional guide on:
how to change your stopwords list for another language
how to modify the stopwords list
how to extract n-grams for the dtm
toy_data_train <- c('Cats like to chase mice.',
'Dogs like to eat big bones.')
# convert data to vector space model
toy_corpus_train <- VCorpus(VectorSource(toy_data_train))
# create a dtm object
toy_dtm_train <- DocumentTermMatrix(toy_corpus_train,
list(removePunctuation = TRUE,
stopwords = c(stopwords("english"), "like"),
stemming = TRUE,
removeNumbers = TRUE))
data.frame(as.matrix(toy_dtm_train))
library(RWeka)
tokenize_n_grams <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 3))
# create an n-gram based dtm object
toy_tdtm_train <- DocumentTermMatrix(toy_corpus_train,
list(tokenize=tokenize_n_grams,
removePunctuation = TRUE,
stopwords = c(stopwords("english"), "like"),
stemming = TRUE,
removeNumbers = TRUE))
data.frame(as.matrix(toy_tdtm_train))
End of Practical