Text clustering
Probabilistic topic modeling
Latent Dirichlet allocation
Text clustering
Probabilistic topic modeling
Latent Dirichlet allocation
–>
–> –>
–> –> –>
–> –>
–> –>
–> –>
–>
–>
–>
–>
–>
–> –>
–>
–>
–> –> –> –> –> –> –> –> –> –>
–> –>
–>
–>
–>
–>
–>
–>
–>
–> –> –> –> –> –> –> –> –> –>
–> –> –> –> –> –> –> –> –> –> –> –>
–> –> –> –> –> –> –> –> –> –> –>
–> –> –>
–>
–>
–>
–> –> –> –> –> –> –>
–> –>
–>
–>
–>
–> –>
–>
–>
–>
–>
–>
–> –>
–>
–>
–>
–>
–> –>
–>
–>
–>
–>
–>
–> –>
–>
–>
–>
–> –>
–> –>
–>
–>
–>
–> –> –> –>
Three concepts: words, topics, and documents
Documents are a collection of words and have a probability distribution over topics
Topics have a probability distribution over words
Model:
Treat data as observations that arise from a generative probabilistic process that includes hidden variables: For documents, the hidden variables reflect the thematic structure of the collection.
Infer the hidden structure using posterior inference: What are the topics that describe this collection?
Situate new data into the estimated model: How does this query or new document fit into the estimated topic structure?
What is latent Dirichlet allocation? It’s a way of automatically discovering topics that these sentences contain.
Suppose you have the following set of sentences:
Given these sentences and asked for 2 topics, LDA might produce something like:
How does LDA perform this discovery?
And for each topic t, compute two things:
p(topic t | document d) = the proportion of words in document d that are currently assigned to topic t, and
p(word w | topic t) = the proportion of assignments to topic t over all documents that come from this word w.
Reassign w a new topic, where we choose topic t with probability p(topic t | document d) * p(word w | topic t)
In other words, in this step, we’re assuming that all topic assignments except for the current word in question are correct, and then updating the assignment of the current word using our model of how documents are generated.
After repeating the previous step a large number of times, you’ll eventually reach a roughly steady state where your assignments are pretty good.
Use these assignments to estimate the topic mixtures of each document (by counting the proportion of words assigned to each topic within that document) and the words associated to each topic (by counting the proportion of words assigned to each topic overall).
library(topicmodels) LDA(data, k = 5, method= "Gibbs", control = list(seed = 321))
Scalability
Ability to deal with various types of data
No/less assumption about input data
Minimal requirement about domain knowledge
Interpretability and usability
Internal criterion: A good clustering will produce high quality clusters in which:
the intra-class (that is, intra-cluster) similarity is high
the inter-class similarity is low
The measured quality of a clustering depends on both the document representation and the similarity measure used
Criteria to determine whether the clusters are meaningful
Internal validation
External validation
Coherence
Inter-cluster similarity v.s. intra-cluster similarity
Davies–Bouldin index
\(DB = \frac{1}{k}\sum_{i=1}^k{\underset{j \neq i}{\operatorname{max}}{(\frac{\sigma_i + \sigma_j}{d(c_i,c_j)})}}\) ← Evaluate every pair of clusters
We prefer smaller DB-index!
Quality measured by its ability to discover some or all of the hidden patterns or latent classes in gold standard data
Assesses a clustering with respect to ground truth … requires labeled data
Assume documents with \(C\) gold standard classes, while our clustering algorithms produce \(K\) clusters, \(\omega_1\), \(\omega_2\), …, \(\omega_K\) with \(n_i\) members.
Text clustering
In clustering, clusters are inferred from the data without human input (unsupervised learning)
Topic modeling