How to represent a document?
What are vector space and bag-of-words models?
How to classify text data?
How to evaluate a classifier?
How to represent a document?
What are vector space and bag-of-words models?
How to classify text data?
How to evaluate a classifier?
Supervised learning: Learning a function that maps an input to an output based on example input-output pairs.
infer a function from labeled training data
use the inferred function to label new instances
Human experts annotate a set of text data
Which problem is not (or less likely to be) a text classification task?
Author’s gender detection from text
Finding about the smoking conditions (yes/no) of patients from clinical letters
Grouping similar news articles
Classifying reviews into positive and negative sentiments
Represent by a string?
Represent by a list of sentences?
A vector space is a collection of vectors
A vector is an ordered finite list of numbers.
Represent documents by concept vectors
Each concept defines one dimension
A large number of concepts define a high-dimensional space
Element of vector corresponds to concept weight
The process of converting text into numbers is called Vectorization
Distance between the vectors in this concept space
Terms are generic features that can be extracted from text
Typically, terms are single words, keywords, n-grams, or phrases
Documents are represented as vectors of terms
Each dimension (concept) corresponds to a separate term
\[d = (w_1, ..., w_n)\]
Bag of Words, a Vector Space Model where:
Terms: words (more generally we may use n-grams, etc.)
Weights: number of occurrences of the terms in the document
Document-Term Matrix (DTM), Term-Document Matrix (TDM)
Topics (later)
Word embeddings (later)
Words (terms) and weights as the basis for vector representations of text
Doc1: Text mining is to identify useful information.
Doc2: Useful information is mined from text.
Doc3: Apple is delicious.
Binary
Idea: a term is more important if it occurs more frequently in a document
use the raw frequency count of term \(t\) in doc \(d\)
Idea: a term is more discriminative if it occurs a lot but only in fewer documents.
TF-IDF (term frequency–inverse document frequency) weight:
\[w_{d,t} = TF_{d,t} \cdot IDF_t\]
Let \(n_{d,t}\) denote the number of times term \(t\) appears in document \(d\). The relative frequency of \(t\) in \(d\) is:
\[TF_{d,t} = \frac{n_{d,t}}{\sum_i{n_{d,i}}}\] Let \(N\) denote the number of documents annd \(N_t\) denote the number of documents containing term \(t\).
\[IDF_t = log(\frac{N}{N_t})\]
Supervised learning
Input:
A training set of \(m\) manually-labeled documents \((d_1,c_1),\cdots,(d_m,c_m)\)
A fixed set of classes \(C = \{c_1, c_2,…, c_J\}\)
Output:
Rules based on combinations of words or other features
Rules carefully refined by expert
But building and maintaining these rules is expensive
Data/Domain specifics
Not recommended!
Each class is represented by its centroid, with test samples classified to the class with the nearest centroid. Using a training set of documents, the Rocchio algorithm builds a prototype vector, centroid, for each class. This prototype is an average vector over the training documents’ vectors that belong to a certain class.
\[\boldsymbol{\mu_c} = \frac{1}{|D_c|}\sum_{\mathbf{d} \in D_c}{\mathbf{d}}\]
Where \(D_c\) is the set of documents in the corpus that belongs to class \(c\) and \(d\) is the vector representation of document \(d\).
The predicted label of document d is the one with the smallest (Euclidean) distance between the document and the centroid.
\[\hat{c} = \arg \min_c ||\boldsymbol{\mu_c} - \mathbf{d}||\]
Given a test document \(d\), the KNN algorithm finds the \(k\) nearest neighbors of \(d\) among all the documents in the training set, and scores the category candidates based on the class of the k neighbors.
After sorting the score values, the algorithm assigns the candidate to the class with the highest score.
The basic nearest neighbors classification uses uniform weights: that is, the value assigned to a query point is computed from a simple majority vote of the nearest neighbors.
Can weight the neighbors such that nearer neighbors contribute more to the fit.
Applied to documents and classes
For a document \(d\) and a class \(c\)
\[P(c|d) = \frac{P(c)P(d|c)}{P(d)}\]
\[C_{NB} = \underset{c \in C}{\operatorname{argmax}}P(c) \cdot P(w_1, w_2, \ldots,w_n|c)\]
Bag of Words assumption: Assume position doesn’t matter
Conditional Independence: Assume the feature probabilities \(P(w_i|c)\) are independent given the class \(c\).
\[P(w_1, \ldots, w_n|c) = P(w_1 | c) \cdot P(w_2|c) \cdot P(w_3|c) \cdot \ldots \cdot P(w_n|c)\]
\[C_{NB} = \underset{c \in C}{\operatorname{argmax}}P(c) \cdot P(w_1 | c) \cdot P(w_2|c) \cdot P(w_3|c) \cdot \ldots \cdot P(w_n|c) \] \[C_{NB} = \underset{c \in C}{\operatorname{argmax}}P(c)\prod_{i \in positions}{P(w_i|c)}\]
First attempt: maximum likelihood estimates
\[\hat{P}(c) = \frac{count(C = c)}{N_{doc}}\] \[\hat{P}(w_i|c) = \frac{count(w_i, c)}{\sum_{w \in V}count(w, c)}\]
What if we have seen no training documents with the word coffee and classified in the topic positive (thumbs-up)?
\[\hat{P}(\mathrm{''coffee''|positive}) = \frac{count(\mathrm{''coffee'', positive})}{\sum_{w \in V}{count(\mathrm{w,positive})}}\]
Zero probabilities cannot be conditioned away, no matter the other evidence!
\[C_{NB} = \underset{c \in C}{\operatorname{argmax}}P(c)\prod_{i \in positions}{P(w_i|c)}\]
\[ \hat{P}(w_i|c) = \frac{count(w_i,c)+1}{\sum_{w \in V}{(count(w,c)+1)}} \]
A decision tree is a hierarchical decomposition of the (training) data space, where a condition on the feature value is used to divide the data space hierarchically.
Top-down, by choosing a variable at each step that best splits the set of items.
Different algorithms to measure the homogeneity of the target variable within the subsets (e.g. Gini impurity, information gain)
Training set
Test set
adapted from https://scikit-learn.org/stable/modules/cross_validation.html
What proportion of instances is correctly classified?
TP + TN / TP + FP + FN + TN
Accuracy is a valid choice of evaluation for classification problems which are well balanced and not skewed.
Let us say that our target class is very sparse. Do we want accuracy as a metric of our model performance? What if we are predicting if an asteroid will hit the earth? Just say “No” all the time. And you will be 99% accurate. The model can be reasonably accurate, but not at all valuable.
Precision (also Positive Predictive Value): % of selected/retrieved items that are correct/relevant
Recall (also sensitivity): % of correct/relevant items that are selected/retrieved.
Precision is a valid choice of evaluation metric when we want to be very sure of our prediction.
Recall is a valid choice of evaluation metric when we want to capture as many positives as possible.
A combined measure that assesses the precision/recall tradeoff is F measure (weighted harmonic mean):
\[F = \frac{(\beta^2+1)PR}{\beta^2P + R}\] where \(\beta\) is a positive real number and is chosen such that recall is considered \(\beta\) times as important as precision.
Balanced F1 measure: \(\beta = 1\), \(F = 2PR/(P+R)\)
If (x or y) and not (w or z) then categorize as class1
Need careful crafting
Low accuracy
Domain-specific
Time-consuming
Active learning
Unsupervised methods (tomorrow)
Use Naïve Bayes, KNN, Rocchio
Get more labeled data
Find ways to label data
Try semi-supervised methods
Try transfer learning
Works with the more complex classifiers
Random forest
Neural network
Can achieve high accuracy!
At a cost:
With enough data
Domain-specific features and weights: very important in real performance
Sometimes need to collapse terms:
Part numbers, chemical formulas, …
But stemming generally doesn’t help
Upweighting: Counting a word as if it occurred twice:
Title words
First sentence of each paragraph (Murata, 1999)
In sentences that contain title words
Hyperparameter optimization
Text classification –> supervised learning –> labeled data
Evaluation