Practical 4: Text Clustering and Topic Modeling¶

Ayoub Bagheri¶

Applied Text Mining - Utrecht Summer School¶

In this practical, we are going to apply different clustering algorithms and a topic modeling approach on sport news articles and cluster them into different groups.

Today we will use the following libraries. Take care to have them installed!

In [4]:

from sklearn.datasets import load_files
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

from sklearn.cluster import MiniBatchKMeans
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram

from sklearn import metrics

from sklearn.decomposition import NMF
from sklearn.decomposition import LatentDirichletAllocation

# for reproducibility
random_state = 321

Let's get started!¶

1. Here we are going to use a another set of the BBC news articles, the BBC Sport dataset. This dataset provided for use as benchmarks for machine learning research. The BBC Sport dataset consists of 737 documents from the BBC Sport website corresponding to sports news articles from 2004-2005 in five topical areas: athletics, cricket, football, rugby, tennis. Upload the bbcsport-fulltext.zip file and extract it using the code below. Convert the resulting object to a dataframe.

2. For text clustering and topic modeling, we will ignore the outcome variable (labels) but we will use them while evaluating models. Create a copied dataframe removing the outcome variable.

3. Apply the following pre-processing steps and convert the data to a document-term matrix with term frequencies:

convert to lowercase
remove stopwords
remove numbers
extract uni- and bi-grams
remove terms that occur in less than 2 documents
remove one-letter terms, e.g.'a', or 's'

In [ ]:

# You can check the vocabulary of your vectorizer with the following line
# tfidf_vectorizer.vocabulary_

K-Means clustering¶

4. Use the MiniBatchKMeans function from the sklearn package, and a K-Means clustering algorithm with 5 clusters.

5. What are the top terms in each cluster?

Based on these 10 top terms we can manually label our clustering output as:

cluster 0: cricket
cluster 1: tennis
cluster 2: football
cluster 3: athletics
cluster 4: rugby

6. Visualize the output of the K-Means clustering: first apply a PCA method to transform the high-dimensional feature space into 2 dimensions, and then plot the points using a scatter plot.

7. Evaluate the quality of the K-Means clustering with the sklearn metrics for clustering: homogeneity_score, completeness_score, v_measure_score, adjusted_rand_score, silhouette_score.

Evaluation for unsupervised learning algorithms is a bit difficult and requires human judgement but there are some metrics which you might use. There are two kinds of metrics you can use depending on whether or not you have the labels.

If you have a labelled dataset you can use metrics that give you an idea of how good your clustering model is. For this purpose you can use the sklearn.metrics module, for example homogeneity_score is one of the possible metrics. As per the documentation, the score ranges between 0 and 1 where 1 stands for perfectly homogeneous labeling.

If you do not have labels for your dataset, then you can still evaluate your clustering model with some metrics. One of them is the silhouette_score. From the sklearn's documentation: The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a,b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of. The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.

8. Apply the K-Means clustering method on a range of 3 to 7 clusters, and calculate the squared loss obtained in each clustering. Apply the Elbow method to find the optimal k. (Tip: use the cls.inertia_ for the squared loss)

9. Use the following two news articles as your test data, and predict cluster labels for your new dataset with the best value for K and the K-Means algorithm.

In [16]:

documents = ['Frank de Boer out as Oranje manager after early Euro 2020 exit Dutch men’s football team coach.',
             'The time has come for Nadal to be selective in the events that he should and should not play. This is where he can start the difficulty. After a rigorous participation of the clay season, Rafael Nadal definitely wants to conserve his energies for as long as possible.']

Hierarchical clustering¶

10. Hierarchical clustering is a type of unsupervised machine learning algorithm used to cluster unlabeled data points. Similar to the K-Means clustering, hierarchical clustering groups the data points with similar characteristics together. Apply a hierarchical clustering with the ward linkage on the BBC Sport dataset. Fit the model with 5 clusters and check the predicted labels.

11. Plot a dendrogram for your hierarchical clustering model using the function below. To do this, you need to fit the model again without assigning the number of clusters.

Topic modeling¶

Topic modeling is another unsupervised method for text mining applications where we want to get an idea of what topics we have in our dataset. A topic is a collection of words that describe the overall theme. For example, in case of news articles, you might think of topics as the categories in the dataset. Just like clustering algorithms, there are some algorithms that need you to specify the number of topics you want to extract from the dataset and some that automatically determine the number of topics. Here, we are going to use the Latent Dirichlet Allocation (LDA) method for topic modeling. You can check sklearn's documentation for more details about LDA from the sklearn.decomposition module.

12. One of the mainly used approaches for topic modeling is Latent Dirichlet Allocation (LDA). The LDA is based upon two general assumptions:

Documents exhibit multiple topics
A topic is a distribution over a fixed vocabulary

Train a LDA model from the sklearn package for topic modeling with 5 components.**

We used LDA to create topics along with the probability distribution for each word in our vocabulary for each topic. The parameter n_components specifies the number of categories, or topics, that we want our text to be divided into.

13. Print the 10 words with highest probabilities for all the five topics.

You can also use the following function for this purpose:

In [26]:

def display_topics(model, feature_names, no_top_words):
    topic_dict = {}
    for topic_idx, topic in enumerate(model.components_):
        topic_dict["Topic %d words" % (topic_idx)]= ['{}'.format(feature_names[i])
                        for i in topic.argsort()[:-no_top_words - 1:-1]]
        topic_dict["Topic %d weights" % (topic_idx)]= ['{:.1f}'.format(topic[i])
                        for i in topic.argsort()[:-no_top_words - 1:-1]]
    return pd.DataFrame(topic_dict)

14. Transform the learned topics into your data. Check the shape of the output. What can be the use of this output?

This output can be used as our new observations and features for further tasks.

15. Use the score function for LDA to calculate the log likelihood for your data. Compare two LDA models with 5 and 10 topics.

Many procedures use the log of the likelihood, rather than the likelihood itself, because it is easier to work with. The log likelihood (i.e., the log of the likelihood) will always be negative, with higher values (closer to zero) indicating a better fitting model. Here this belongs to the model with 5 topics.