Introduction

Welcome to the third practical of the course “Introduction to Text Mining with R”. In this practical we will start with creating a pipeline for text classification using simple training and test sets. In the second part of the practical, we will train machine learning techniques on a news dataset and we will compare them.

In this practical we are going to use the following packages:

library(caret)
library(tidyverse)
library(tm)
library(e1071)

Document-term matrix



A simple pipeline


  1. In this part, we will use two sentences to train a model and one sentence to test it. Assume our data consists of two sentences:
  • ‘Cats like to chase mice.’,
  • ‘Dogs like to eat big bones.’
    Use these sentences as our training set and create a document-term matrix (dtm). Also remove punctuations, stopwords, numbers and apply stemming. Hint: Use the tm package. To create a dtm first you need to create a vector space model.

toy_data_train <- c('Cats like to chase mice.', 
                    'Dogs like to eat big bones.')

# convert data to vector space model
toy_corpus_train <- VCorpus(VectorSource(toy_data_train))

# create a dtm object
toy_dtm_train <- DocumentTermMatrix(toy_corpus_train, 
                                    list(removePunctuation = TRUE, 
                                         stopwords = TRUE, 
                                         stemming = TRUE, 
                                         removeNumbers = TRUE))

  1. Inspect the toy_dtm_train object.

inspect(toy_dtm_train)
## <<DocumentTermMatrix (documents: 2, terms: 8)>>
## Non-/sparse entries: 9/7
## Sparsity           : 44%
## Maximal term length: 5
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs big bone cat chase dog eat like mice
##    1   0    0   1     1   0   0    1    1
##    2   1    1   0     0   1   1    1    0
#inspect(toy_dtm_train[1:rownumber, 1:colnumber])

  1. Convert the toy_dtm_train object to a matrix and add a ‘y’ column as the outcome variable which is 0 for the first row and 1 for the second row. Then, convert the toy_dtm_train object to a dataframe.

toy_data_train <- cbind(as.matrix(toy_dtm_train), c(0, 1))

colnames(toy_data_train)[ncol(toy_data_train)] <- 'y'
toy_data_train <- as.data.frame(toy_data_train)
toy_data_train$y <- as.factor(toy_data_train$y)
toy_data_train

  1. From the ‘caret’ package, use the train function with method = ‘bayesglm’ and fit a model on the toy_dtm_train object. For ‘bayesglm’, you may need to install Package ‘arm’ which is a library for data analysis using regression and multilevel/hierarchical models.

fit_toy <- train(y ~ ., data = toy_data_train, method = 'bayesglm')
summary(fit_toy)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
## [1]  0  0
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -0.3102     4.6597  -0.067    0.947
## big           0.6204     2.3431   0.265    0.791
## bone          0.6204     2.3431   0.265    0.791
## cat          -0.6204     2.3431  -0.265    0.791
## chase        -0.6204     2.3431  -0.265    0.791
## dog           0.6204     2.3431   0.265    0.791
## eat           0.6204     2.3431   0.265    0.791
## like          0.0000     2.5000   0.000    1.000
## mice         -0.6204     2.3431  -0.265    0.791
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2.77259  on  1  degrees of freedom
## Residual deviance: 0.43185  on -7  degrees of freedom
## AIC: 18.432
## 
## Number of Fisher Scoring iterations: 20

  1. Use the fitted model to predict the outcome on the training data.

predict(fit_toy)
## [1] 0 1
## Levels: 0 1

Here, you see that the model perfectly predicts the category for our training data.


  1. Use the sentence “Rats eat cheese.” as your test data. What is the prediction of your model? What steps do you need to get the prediction?

toy_test_sentence <- c('Rats eat cheese.')

# convert to vector space model
toy_corpus_test <- VCorpus(VectorSource(toy_test_sentence))

# convert to a dtm
toy_dtm_test <- DocumentTermMatrix(toy_corpus_test, 
                                   control = list(dictionary = Terms(toy_dtm_train), # remember that we need to use the terminology from the training data         
                                                  removePunctuation = TRUE, 
                                                  stopwords = TRUE, 
                                                  stemming = TRUE, 
                                                  removeNumbers = TRUE))
toy_test_data <- as.matrix(toy_dtm_test)

# predict category for the test data
predict(fit_toy, newdata = toy_test_data)
## [1] 1
## Levels: 0 1

  1. Now we want to use a tf-idf weighting to create our dtm object. Remember to remove punctuations, stopwords, numbers and apply the stemming. What is different in this new tf-idf-based dtm? and in your prediction? Repeat the analysis in Q.3 to Q.6 with the new object.

# create a dtm object
tou_dtm_train_tfidf <- DocumentTermMatrix(toy_corpus_train, 
                                          list(weighting = weightTfIdf,
                                               removePunctuation = TRUE, 
                                               stopwords = TRUE, 
                                               stemming = TRUE, 
                                               removeNumbers = TRUE))

# create the y variable
toy_train_data_tfidf <-cbind(as.matrix(tou_dtm_train_tfidf), c(0, 1))
colnames(toy_train_data_tfidf)[ncol(toy_train_data_tfidf)] <- 'y'
toy_train_data_tfidf <- as.data.frame(toy_train_data_tfidf)
toy_train_data_tfidf$y <- as.factor(toy_train_data_tfidf$y)
toy_train_data_tfidf

# fit the naive Bayes model with the tfidf representation
toy_fit_tfidf <- train(y ~ ., data = toy_train_data_tfidf, method = 'bayesglm')
summary(toy_fit_tfidf)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
## [1]  0  0
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -0.3102     3.9294  -0.079    0.937
## big           3.1021    11.7155   0.265    0.791
## bone          3.1021    11.7155   0.265    0.791
## cat          -2.4817     9.3724  -0.265    0.791
## chase        -2.4817     9.3724  -0.265    0.791
## dog           3.1021    11.7155   0.265    0.791
## eat           3.1021    11.7155   0.265    0.791
## like          0.0000     2.5000   0.000    1.000
## mice         -2.4817     9.3724  -0.265    0.791
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2.77259  on  1  degrees of freedom
## Residual deviance: 0.43185  on -7  degrees of freedom
## AIC: 18.432
## 
## Number of Fisher Scoring iterations: 20

# prediction on training data
predict(toy_fit_tfidf)
## [1] 0 1
## Levels: 0 1

# prediction on test data
# convert to a dtm
tou_dtm_test_tfidf <- DocumentTermMatrix(toy_corpus_test, 
                                        control = list(dictionary = Terms(tou_dtm_train_tfidf), # remember that we need to use the vocabulary from the training data
                                                       weighting = weightTfIdf,
                                                       removePunctuation = TRUE, 
                                                       stopwords = TRUE, 
                                                       stemming = TRUE, 
                                                       removeNumbers = TRUE))
toy_test_data_tfidf <- as.matrix(tou_dtm_test_tfidf)

# predict category for the test data
predict(toy_fit_tfidf, newdata = toy_test_data_tfidf)
## [1] 0
## Levels: 0 1

Compare architectures: Naïve Bayes versus SVM

The data set used in this part of the practical is the BBC News data set. The raw data set can be downloaded from here. It consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004 to 2005. These areas are:

  • Business
  • Entertainment
  • Politics
  • Sport
  • Tech

  1. Load the data set from the data folder and inspect its first rows.

Note that the loaded data set object is called df_final.


load("data/news_dataset.rda")
head(df_final)

#can take a subset (e.g. 20% of the original data set) to reduce computational time
set.seed(123)
df_final <- sample_frac(df_final, 0.2)

  1. Find out about the name of the categories and the number of observations in each of them.

# list of the categories in the data set
unique(df_final$Category)
## [1] "entertainment" "business"      "tech"          "politics"     
## [5] "sport"
table(df_final$Category)
## 
##      business entertainment      politics         sport          tech 
##            95            69            85           108            88

  1. Create a document-term matrix for the entire dataset and save the terms with the frequency higher than 10 in a new variable and name it features. In this process, convert the word into lowercase, remove punctuations, numbers, stopwords, whitespaces and clean any other data issues.

corpus_BBC <- Corpus(VectorSource(df_final$Content))

# standardize to lowercase
corpus_BBC <- tm_map(corpus_BBC, content_transformer(tolower)) 
# returns error invalid multibyte string 1512
# Two possible solutions: 
# 1) Omission of the whole observation
# df_final <- df_final[-1512,]
# 2) Omission of the string "\xa315.8m" from that specifc text
# df_final[1512,][2] <- gsub("\xa315.8m", "", df_final[1512,][2])
# then rerun from line 255.

# remove tm stopwords
corpus_BBC <- tm_map(corpus_BBC, removeWords, stopwords())
# standardize whitespaces
corpus_BBC <- tm_map(corpus_BBC, stripWhitespace)
# remove punctuation
corpus_BBC <- tm_map(corpus_BBC, removePunctuation)
# remove numbers
corpus_BBC <- tm_map(corpus_BBC, removeNumbers)

dtm_BBC <- DocumentTermMatrix(corpus_BBC)

# words appearing more than 10x
features_BBC <- findFreqTerms(dtm_BBC, 10)

head(features_BBC)
## [1] "able"      "achieved"  "age"       "ago"       "air"       "alongside"

  1. Partition the original data and the corpus object into training and test sets with 80% for the training set and 20% for the test set.

## set the seed to make your partition reproducible
set.seed(123)

train_idx <- createDataPartition(df_final$Category, p=0.80, list=FALSE)
# set for the original raw data 
train_BBC_raw <- df_final[train_idx,]
test_BBC_raw <- df_final[-train_idx,]
# set for the cleaned-up data
train_BBC_clean <- corpus_BBC[train_idx]
test_BBC_clean <- corpus_BBC[-train_idx]

  1. Create separate document-term matrices for the training and the test sets using the features variable as dictionary and convert them into data frames.

dtm_BBC_train <- DocumentTermMatrix(train_BBC_clean, list(dictionary=features_BBC))
dtm_BBC_test  <- DocumentTermMatrix(test_BBC_clean, list(dictionary=features_BBC))

dtm_BBC_train <- as.data.frame(as.matrix(dtm_BBC_train))
dtm_BBC_test <- as.data.frame(as.matrix(dtm_BBC_test))
str(dtm_BBC_test)
## 'data.frame':    87 obs. of  2025 variables:
##  $ able          : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ achieved      : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ age           : num  4 0 0 0 0 0 0 1 0 0 ...
##  $ ago           : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ air           : num  1 0 0 0 1 0 0 0 0 0 ...
##  $ alongside     : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ ambition      : num  1 0 0 0 0 0 0 1 0 0 ...
##  $ april         : num  2 0 0 0 0 0 0 0 0 0 ...
##  $ around        : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ attention     : num  1 0 0 0 0 0 0 1 0 0 ...
##  $ attract       : num  1 0 0 0 0 0 0 1 0 0 ...
##  $ became        : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ become        : num  1 1 0 0 0 0 0 0 0 0 ...
##  $ began         : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ beyond        : num  1 1 0 0 0 0 0 0 0 0 ...
##  $ book          : num  4 0 0 0 0 0 0 0 0 0 ...
##  $ bought        : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ broadcasting  : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ calling       : num  1 0 0 0 1 0 0 0 0 0 ...
##  $ came          : num  1 0 0 1 0 0 0 0 0 0 ...
##  $ certainly     : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ child         : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ children      : num  1 0 0 0 0 5 0 0 0 0 ...
##  $ comedy        : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ continued     : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ countries     : num  1 0 0 1 0 0 0 0 0 0 ...
##  $ couple        : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ critics       : num  1 0 0 1 0 0 0 1 0 0 ...
##  $ debut         : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ decided       : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ despite       : num  1 2 0 1 0 0 0 0 0 0 ...
##  $ died          : num  2 0 0 0 0 0 0 0 0 0 ...
##  $ disaster      : num  2 0 0 0 0 0 0 0 0 0 ...
##  $ early         : num  1 0 0 0 0 0 0 1 0 0 ...
##  $ east          : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ events        : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ eventually    : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ feature       : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ fell          : num  1 0 1 0 0 0 0 0 0 0 ...
##  $ film          : num  2 0 0 0 0 0 0 0 0 0 ...
##  $ final         : num  2 0 0 0 0 0 0 1 5 0 ...
##  $ finance       : num  1 1 0 0 0 0 0 0 0 0 ...
##  $ first         : num  2 0 0 1 0 0 0 1 1 0 ...
##  $ flight        : num  4 0 0 0 5 0 0 0 0 0 ...
##  $ flying        : num  1 0 0 0 1 0 0 0 0 0 ...
##  $ followed      : num  3 0 0 0 0 1 0 0 0 0 ...
##  $ following     : num  1 0 0 0 0 0 0 1 0 0 ...
##  $ follows       : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ force         : num  1 0 0 0 0 0 0 1 0 0 ...
##  $ former        : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ four          : num  1 0 1 0 0 0 1 1 1 0 ...
##  $ full          : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ got           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ greatest      : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ ground        : num  1 0 0 0 1 0 0 0 0 0 ...
##  $ happen        : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ health        : num  1 0 0 0 0 1 1 0 0 0 ...
##  $ heart         : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ high          : num  2 0 0 0 0 0 0 2 0 0 ...
##  $ hit           : num  2 0 1 0 0 0 0 0 0 0 ...
##  $ hollywood     : num  2 0 0 0 0 0 0 0 0 0 ...
##  $ home          : num  1 0 0 1 0 0 1 0 0 0 ...
##  $ hotel         : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ industry      : num  1 0 0 0 2 0 0 1 0 0 ...
##  $ james         : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ just          : num  1 0 0 2 0 0 0 1 0 2 ...
##  $ keep          : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ king          : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ known         : num  2 0 0 0 0 0 0 0 0 0 ...
##  $ later         : num  4 0 0 0 0 1 0 0 0 0 ...
##  $ life          : num  2 0 0 0 0 1 0 0 0 0 ...
##  $ lists         : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ manager       : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ martin        : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ middle        : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ months        : num  1 0 1 0 0 0 0 1 0 1 ...
##  $ moved         : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ movie         : num  2 0 0 0 0 0 0 0 0 0 ...
##  $ name          : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ national      : num  1 1 0 1 0 0 0 0 0 0 ...
##  $ never         : num  2 0 0 0 0 0 0 1 0 0 ...
##  $ new           : num  1 0 0 0 0 0 1 2 1 0 ...
##  $ nine          : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ parents       : num  1 0 0 0 0 4 0 0 0 0 ...
##  $ places        : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ popular       : num  1 0 0 0 0 0 0 1 0 0 ...
##  $ popularity    : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ produce       : num  2 0 0 0 0 0 0 0 0 0 ...
##  $ producers     : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ prompted      : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ proved        : num  1 0 1 0 0 0 1 0 0 0 ...
##  $ published     : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ quickly       : num  1 0 1 0 0 0 1 0 0 0 ...
##  $ real          : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ recent        : num  1 1 0 0 0 0 1 3 0 0 ...
##  $ remains       : num  1 0 0 0 0 0 0 2 0 0 ...
##  $ research      : num  1 1 0 0 0 0 0 0 0 0 ...
##  $ response      : num  1 0 1 0 0 0 1 0 0 0 ...
##  $ return        : num  1 0 0 0 0 0 0 0 0 0 ...
##   [list output truncated]


  1. Show 20 most frequent terms and their frequencies in a bar plot.

freq         <- colSums(dtm_BBC_train)
high_freq    <- tail(sort(freq), n = 20)
hfp_df       <- as.data.frame(sort(high_freq))
hfp_df$names <- rownames(hfp_df) 

ggplot(hfp_df, aes(reorder(names, high_freq), high_freq)) +
  geom_bar(stat="identity") + 
  coord_flip() + 
  xlab("Terms") + 
  ylab("Frequency") +
  ggtitle("Term frequencies")


  1. Use the cbind function to add the categories to the dtm_train data and name the column y.

dtm_BBC_train <- cbind(cat=factor(train_BBC_raw$Category), dtm_BBC_train)
dtm_BBC_test <- cbind(cat=factor(test_BBC_raw$Category), dtm_BBC_test)
dtm_BBC_train<-as.data.frame(dtm_BBC_train)
dtm_BBC_test<-as.data.frame(dtm_BBC_test)

  1. Fit a naive Bayes model on the training data set and name it fit_BBC_nb. Check the summary of the fitted model and predict the categories for the training data.

# gc()
# Naive Bayes with Laplace smoothing
fit_BBC_nb <- naiveBayes(cat~., data=dtm_BBC_train, laplace = 1)
summary(fit_BBC_nb)
##           Length Class  Mode     
## apriori      5   table  numeric  
## tables    2025   -none- list     
## levels       5   -none- character
## isnumeric 2025   -none- logical  
## call         4   -none- call

pred_BBC_nb_train <- predict(fit_BBC_nb, na.omit(dtm_BBC_train))
fit_BBC_nb_table <- table(na.omit(dtm_BBC_train$cat), pred_BBC_nb_train, dnn=c("Actual", "Predicted"))
fit_BBC_nb_table
##                Predicted
## Actual          business entertainment politics sport tech
##   business            29            47        0     0    0
##   entertainment        0            56        0     0    0
##   politics             0            46       21     1    0
##   sport                0            41        0    46    0
##   tech                 0            47        1     3   20

The predict function allows you to specify whether you want the most probable class or if you want to get the probability for every class. Nothing changes with the exception being the type parameter is set to “raw”.


  1. Fit a SVM model with a linear kernel on the training data set and name it fit_BBC_svm. Check the summary of the fitted model and predict the categories for the training data.

fit_BBC_svm <- svm(cat~., data=dtm_BBC_train)
summary(fit_BBC_svm)
## 
## Call:
## svm(formula = cat ~ ., data = dtm_BBC_train)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  336
## 
##  ( 63 60 56 83 74 )
## 
## 
## Number of Classes:  5 
## 
## Levels: 
##  business entertainment politics sport tech

pred_BBC_svm_train <- predict(fit_BBC_svm, na.omit(dtm_BBC_train))
fit_BBC_svm_table <- table(na.omit(dtm_BBC_train$cat), pred_BBC_svm_train, dnn=c("Actual", "Predicted"))
fit_BBC_svm_table
##                Predicted
## Actual          business entertainment politics sport tech
##   business            75             0        0     1    0
##   entertainment        5            30        0    21    0
##   politics             2             0       55    11    0
##   sport                0             0        0    87    0
##   tech                 1             0        0     9   61

  1. Now prepare the test data for the models, then check the prediction performances.

# prediction on test data
pred_BBC_nb_test <- predict(fit_BBC_nb, na.omit(dtm_BBC_test))
fit_BBC_nb_test <- table(na.omit(dtm_BBC_test$cat), pred_BBC_nb_test, dnn=c("Actual", "Predicted"))
fit_BBC_nb_test
##                Predicted
## Actual          business entertainment politics sport tech
##   business             3            15        1     0    0
##   entertainment        0            13        0     0    0
##   politics             0            13        2     2    0
##   sport                0            11        0    10    0
##   tech                 0            15        0     1    1
# prediction on test data
pred_BBC_svm_test <- predict(fit_BBC_svm, na.omit(dtm_BBC_test))
fit_BBC_svm_test <- table(na.omit(dtm_BBC_test$cat), pred_BBC_svm_test, dnn=c("Actual", "Predicted"))
fit_BBC_svm_test
##                Predicted
## Actual          business entertainment politics sport tech
##   business            14             0        2     2    1
##   entertainment        1             3        0     9    0
##   politics             2             0       13     2    0
##   sport                0             0        0    21    0
##   tech                 1             0        0     5   11

  1. Calculate Accuracy, Sensitivity, Specificity, Pos Pred Value, and Neg Pred Value for the test set using each of the two models.

The baseline performance (accuracy) is:

table(dtm_BBC_train$cat)
## 
##      business entertainment      politics         sport          tech 
##            76            56            68            87            71
cat("\n The baseline prediction accuracy: ")
## 
##  The baseline prediction accuracy:
mean(na.omit(dtm_BBC_test$cat) == "sport")
## [1] 0.2413793
confusionMatrix(fit_BBC_nb_test)
## Confusion Matrix and Statistics
## 
##                Predicted
## Actual          business entertainment politics sport tech
##   business             3            15        1     0    0
##   entertainment        0            13        0     0    0
##   politics             0            13        2     2    0
##   sport                0            11        0    10    0
##   tech                 0            15        0     1    1
## 
## Overall Statistics
##                                           
##                Accuracy : 0.3333          
##                  95% CI : (0.2358, 0.4425)
##     No Information Rate : 0.7701          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.199           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: business Class: entertainment Class: politics
## Sensitivity                  1.00000               0.1940         0.66667
## Specificity                  0.80952               1.0000         0.82143
## Pos Pred Value               0.15789               1.0000         0.11765
## Neg Pred Value               1.00000               0.2703         0.98571
## Prevalence                   0.03448               0.7701         0.03448
## Detection Rate               0.03448               0.1494         0.02299
## Detection Prevalence         0.21839               0.1494         0.19540
## Balanced Accuracy            0.90476               0.5970         0.74405
##                      Class: sport Class: tech
## Sensitivity                0.7692     1.00000
## Specificity                0.8514     0.81395
## Pos Pred Value             0.4762     0.05882
## Neg Pred Value             0.9545     1.00000
## Prevalence                 0.1494     0.01149
## Detection Rate             0.1149     0.01149
## Detection Prevalence       0.2414     0.19540
## Balanced Accuracy          0.8103     0.90698
confusionMatrix(fit_BBC_svm_test)
## Confusion Matrix and Statistics
## 
##                Predicted
## Actual          business entertainment politics sport tech
##   business            14             0        2     2    1
##   entertainment        1             3        0     9    0
##   politics             2             0       13     2    0
##   sport                0             0        0    21    0
##   tech                 1             0        0     5   11
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7126          
##                  95% CI : (0.6057, 0.8046)
##     No Information Rate : 0.4483          
##     P-Value [Acc > NIR] : 5.473e-07       
##                                           
##                   Kappa : 0.632           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: business Class: entertainment Class: politics
## Sensitivity                   0.7778              1.00000          0.8667
## Specificity                   0.9275              0.88095          0.9444
## Pos Pred Value                0.7368              0.23077          0.7647
## Neg Pred Value                0.9412              1.00000          0.9714
## Prevalence                    0.2069              0.03448          0.1724
## Detection Rate                0.1609              0.03448          0.1494
## Detection Prevalence          0.2184              0.14943          0.1954
## Balanced Accuracy             0.8527              0.94048          0.9056
##                      Class: sport Class: tech
## Sensitivity                0.5385      0.9167
## Specificity                1.0000      0.9200
## Pos Pred Value             1.0000      0.6471
## Neg Pred Value             0.7273      0.9857
## Prevalence                 0.4483      0.1379
## Detection Rate             0.2414      0.1264
## Detection Prevalence       0.2414      0.1954
## Balanced Accuracy          0.7692      0.9183

Summary


In this practical, we learned about:

  • Documet-term matrix representation
  • tf and tfidf methods
  • Naive Bayes and SVM
  • Model comparison

Here we have some additional guide on:

  • how to change your stopwords list for another language

  • how to modify the stopwords list

  • how to extract n-grams for the dtm

toy_data_train <- c('Cats like to chase mice.', 
                    'Dogs like to eat big bones.')

# convert data to vector space model
toy_corpus_train <- VCorpus(VectorSource(toy_data_train))

# create a dtm object
toy_dtm_train <- DocumentTermMatrix(toy_corpus_train, 
                                    list(removePunctuation = TRUE, 
                                         stopwords = c(stopwords("english"), "like"), 
                                         stemming = TRUE, 
                                         removeNumbers = TRUE))

data.frame(as.matrix(toy_dtm_train))

library(RWeka)
tokenize_n_grams <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 3))
# create an n-gram based dtm object
toy_tdtm_train <- DocumentTermMatrix(toy_corpus_train, 
                                     list(tokenize=tokenize_n_grams,
                                          removePunctuation = TRUE, 
                                          stopwords = c(stopwords("english"), "like"), 
                                          stemming = TRUE, 
                                          removeNumbers = TRUE))
data.frame(as.matrix(toy_tdtm_train))

End of Practical