
In this practical, we will apply a clustering method on news articles to cluster them into different groups. Here we are going to use the following packages:


Read data

  1. The dataset that we are going to use is the BBC dataset from practical 3. This dataset consists of 2225 documents and 5 categories: business, entertainment, politics, sport, and tech. For text clustering and topic modeling, we will ignore the labels but we will use them while evaluating models. Load the Rda file of the news dataset.

# Note that the next chunk (when using `DocumentTermMatrix`) will not work
# returns error invalid multibyte string 1512
# Two possible solutions: 
# 1) Omission of the whole observation
# df_final <- df_final[-1512,]
# 2) Omission of the string "\xa315.8m" from that specifc text
df_final[1512,][2] <- gsub("\xa315.8m", "", df_final[1512,][2])
# I prefer the second solution

  1. First we must preprocess the corpus. Create a document-term matrix from the news column of the dataframe. Complete the following preprocessing steps:
  • convert to lower
  • remove stop words
  • remove numbers
  • remove punctuation
  • convert dtm to a dataframe

docs <- VCorpus(VectorSource(df_final$Content))

dtm <- DocumentTermMatrix(docs,
           control = list(tolower = TRUE,
                          removeNumbers = TRUE,
                          removePunctuation = TRUE,
                          stopwords = TRUE
# We remove A LOT of features. R is natively very weak with high dimensional data
dtm_cut <- removeSparseTerms(dtm,  sparse = 0.93)
# with sparse = 0.93 the dtm will end up with 359 terms; you can adjust this number based on your available memory

# dtm <- as.matrix(dtm) # if you have a supercomputer you can continue with this object, otherwise use the dtm_cut
dtm_cut <- as.matrix(dtm_cut)
# you can also check the wordclouds for dtm_cut
#wordcloud(colnames(dtm), dtm[5,], max.words = 50)

Clustering methods

  1. Use the dist() function from the proxy library and calculate a distance matrix for the dtm_cut object with cosine similarity method. We will use the distance matrix for specific clustering algorithms.

# Cosine distance matrix
dist_matrix <- dist(dtm_cut, method = "cosine")
# if it takes a lot of time for your computer to create the distance matrix load the available computed dist_matrix. We made this available for you. # save(dist_matrix, file = "dist_matrix.RData")
# load("dist_matrix.RData")

  1. Now we can run a k-means clustering algorithm, starting out with three centers. Use the dtm_cut object as the input for kmeans. What does the output look like? Also check the cluster centers.

text_kmeans_clust3 <- kmeans(dtm_cut, centers = 3)
## List of 9
##  $ cluster     : Named int [1:2225] 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "names")= chr [1:2225] "1" "2" "3" "4" ...
##  $ centers     : num [1:3, 1:359] 0.127 0.193 0 0.287 0.208 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:3] "1" "2" "3"
##   .. ..$ : chr [1:359] "â£bn" "â£m" "able" "according" ...
##  $ totss       : num 353497
##  $ withinss    : num [1:3] 162523 142640 17042
##  $ tot.withinss: num 322205
##  $ betweenss   : num 31292
##  $ size        : int [1:3] 1620 600 5
##  $ iter        : int 3
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

# print the results
# text_kmeans_clust3

# show the centers

The output of kmeans is a list with several bits of information. The most important being:

  • cluster: A vector of integers (from 1:k) indicating the cluster to which each point is allocated.
  • centers: A matrix of cluster centers.
  • totss: The total sum of squares.
  • withinss: Vector of within-cluster sum of squares, one component per cluster.
  • tot.withinss: Total within-cluster sum of squares, i.e. sum(withinss).
  • betweenss: The between-cluster sum of squares, i.e. \(totss-tot.withinss\).
  • size: The number of points in each cluster.

  1. Apply a PCA with 2 components on the distance matrix, and then plot the output of kmeans clustering on using the PCA outputs.

Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. Principal Component Analysis (PCA) is a useful technique for exploratory data analysis, allowing you to better visualize the variation present in a dataset with many variables.

# Running the PCA
points <- cmdscale(dist_matrix, k = 2) 

kmeans_clusters <- text_kmeans_clust3$cluster
     main = 'K-Means clustering with 3 clusters',
     col = as.factor(kmeans_clusters),
     mai = c(0, 0, 0, 0),
     mar = c(0, 0, 0, 0),
     xaxt = 'n', yaxt = 'n',
     xlab = '', ylab = '')

  1. There are different ways of choosing k. Let’s repeat steps 3 and 4 with 4, 5 and 6 cluster centers for k-means and compare the visualizations.

text_kmeans_clust4 <- kmeans(dtm_cut, centers = 4)
kmeans_clusters <- text_kmeans_clust4$cluster
     main = 'K-Means clustering with 4 clusters',
     col = as.factor(kmeans_clusters),
     mai = c(0, 0, 0, 0),
     mar = c(0, 0, 0, 0),
     xaxt = 'n', yaxt = 'n',
     xlab = '', ylab = '')

text_kmeans_clust5 <- kmeans(dtm_cut, centers = 5)
kmeans_clusters <- text_kmeans_clust5$cluster
     main = 'K-Means clustering with 5 clusters',
     col = as.factor(kmeans_clusters),
     mai = c(0, 0, 0, 0),
     mar = c(0, 0, 0, 0),
     xaxt = 'n', yaxt = 'n',
     xlab = '', ylab = '')

text_kmeans_clust6 <- kmeans(dtm_cut, centers = 6)
kmeans_clusters <- text_kmeans_clust6$cluster
     main = 'K-Means clustering with 6 clusters',
     col = as.factor(kmeans_clusters),
     mai = c(0, 0, 0, 0),
     mar = c(0, 0, 0, 0),
     xaxt = 'n', yaxt = 'n',
     xlab = '', ylab = '')

  1. Apply the hierarchical clustering method on the distance matrix with Ward’s minimum variance method (“ward.D2”), and complete linkage method (“complete”). Plot the resulting clustering trees (dendrograms).

hierarcom_clustering <- hclust(dist_matrix, method = "complete")
plot(hierarcom_clustering, cex = 0.9, hang = -1)
rect.hclust(hierarcom_clustering, k = 5)

hierarWard_clustering <- hclust(dist_matrix, method = "ward.D2")
plot(hierarWard_clustering, cex = 0.9, hang = -1)
rect.hclust(hierarWard_clustering, k = 5)

  1. Plot the output of clustering with PCA components where you cut the tree into 5 groups.

hierar_clusters_com <- cutree(hierarcom_clustering, k = 5)
     main = 'Hierarchical clustering complete linkage',
     col = as.factor(hierar_clusters_com),
     mai = c(0, 0, 0, 0),
     mar = c(0, 0, 0, 0),
     xaxt = 'n', yaxt = 'n',
     xlab = '', ylab = '')

hierar_clusters_ward <- cutree(hierarWard_clustering, k = 5)
     main = 'Hierarchical clustering complete Ward',
     col = as.factor(hierar_clusters_ward),
     mai = c(0, 0, 0, 0),
     mar = c(0, 0, 0, 0),
     xaxt = 'n', yaxt = 'n',
     xlab = '', ylab = '')

  1. From the library dbscan apply the dbscan algorithm on the distance matrix and plot the output with the PCA components. Compare the visualization with the output of previous methods.

dbscan_clustering <- hdbscan(dist_matrix, minPts = 10)
dbscan_clusters <- dbscan_clustering$cluster
     main = 'Density-based clustering',
     col = as.factor(dbscan_clusters),
     mai = c(0, 0, 0, 0),
     mar = c(0, 0, 0, 0),
     xaxt = 'n', yaxt = 'n',
     xlab = '', ylab = '')


In this practical, we learned about:

  • Text clustering algorithms
  • Kmeans
  • Hclust
  • Dbscan

End of Practical