What is text clustering?
What are the applications?
How to cluster text data?
What is text clustering?
What are the applications?
How to cluster text data?
Clustering: the process of grouping a set of objects into clusters of similar objects
Discover “natural structure” of data
Which one is NOT a text clustering task?
Basic criteria
high intra-cluster similarity
low inter-cluster similarity
No (little) supervision signal about the underlying clustering structure
Need similarity/distance as guidance to form clusters
Hard versus soft clustering
Partitional clustering
Hierarchical clustering
Topic modeling
Hard clustering: Each document belongs to exactly one cluster
Soft clustering: A document can belong to more than one cluster.
Partitional clustering method: Construct a partition of \(n\) documents into a set of \(K\) clusters
Given: a set of documents and the number \(K\)
Find: a partition of \(K\) clusters that optimizes the chosen partitioning criterion
Typical partitional clustering algorithms
k-means clustering
Assumes documents are real-valued vectors.
Clusters based on centroids of points in a cluster, \(c\):
\[\vec \mu(c)=\frac{1}{|c|}\sum_{\vec a \in c}{\vec x}\]
Select \(K\) random docs \(\{s_1, s_2,… s_K\}\) as seeds.
Until clustering converges (or other stopping criterion):
For each document \(d_i\):
(Next, update the seeds to the centroid of each cluster)
For each cluster cj
Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents.
Clustering obtained by cutting the dendrogram at a desired level: each connected component forms a cluster.
Typical hierarchical clustering algorithms
Bottom-up agglomerative clustering
Typical hierarchical clustering algorithms
Top-down divisive clustering
Start with all data as one cluster
Repeatedly splitting the remaining clusters into two
Starts with each document in a separate cluster
The history of merging forms a binary tree or hierarchy.
Many variants to defining closest pair of clusters (linkage methods):
Three concepts: words, topics, and documents
Documents are a collection of words and have a probability distribution over topics
Topics have a probability distribution over words
Model:
Treat data as observations that arise from a generative probabilistic process that includes hidden variables: For documents, the hidden variables reflect the thematic structure of the collection.
Infer the hidden structure using posterior inference: What are the topics that describe this collection?
Situate new data into the estimated model: How does this query or new document fit into the estimated topic structure?
What is latent Dirichlet allocation? It’s a way of automatically discovering topics that these sentences contain.
Suppose you have the following set of sentences:
Given these sentences and asked for 2 topics, LDA might produce something like:
How does LDA perform this discovery?
And for each topic t, compute two things:
p(topic t | document d) = the proportion of words in document d that are currently assigned to topic t, and
p(word w | topic t) = the proportion of assignments to topic t over all documents that come from this word w.
Reassign w a new topic, where we choose topic t with probability p(topic t | document d) * p(word w | topic t)
In other words, in this step, we’re assuming that all topic assignments except for the current word in question are correct, and then updating the assignment of the current word using our model of how documents are generated.
After repeating the previous step a large number of times, you’ll eventually reach a roughly steady state where your assignments are pretty good.
Use these assignments to estimate the topic mixtures of each document (by counting the proportion of words assigned to each topic within that document) and the words associated to each topic (by counting the proportion of words assigned to each topic overall).
Hierarchical LDA (hLDA): automatically mine the hierarchical dimension of topics
Supervised LDA (sLDA): learn topics that are inline with the class label
Hybrid LDA: extracting topics and other information
LDA & BERT: we cover deep learning and BERT later
Scalability
Ability to deal with various types of data
No/less assumption about input data
Minimal requirement about domain knowledge
Interpretability and usability
Internal criterion: A good clustering will produce high quality clusters in which:
The intra-class (that is, intra-cluster) similarity is high
The inter-class similarity is low
The measured quality of a clustering depends on both the document representation and the similarity measure used
Criteria to determine whether the clusters are meaningful
Internal validation
External validation
Coherence
Inter-cluster similarity v.s. intra-cluster similarity
Davies–Bouldin index
\(DB = \frac{1}{k}\sum_{i=1}^k{\underset{j \neq i}{\operatorname{max}}{(\frac{\sigma_i + \sigma_j}{d(c_i,c_j)})}}\) ← Evaluate every pair of clusters
We prefer smaller DB-index!
Quality measured by its ability to discover some or all of the hidden patterns or latent classes in gold standard data
Assesses a clustering with respect to ground truth … requires labeled data
Assume documents with \(C\) gold standard classes, while our clustering algorithms produce \(K\) clusters, \(\omega_1\), \(\omega_2\), …, \(\omega_K\) with \(n_i\) members.
Text clustering
In clustering, clusters are inferred from the data without human input (unsupervised learning)
Many ways of influencing the outcome of clustering: number of clusters, similarity measure, representation of documents
Evaluation