Lecture’s Plan

  • How to represent a document?

  • What are vector space and bag-of-words models?

  • How to classify text data?

  • How to evaluate a classifier?

Text Classification

Text classification

  • Supervised learning: Learning a function that maps an input to an output based on example input-output pairs.

    • infer a function from labeled training data

    • use the inferred function to label new instances

  • Human experts annotate a set of text data

    • Training set

Text classification?

  • Which problem is not (or less likely to be) a text classification task?

    • Author’s gender detection from text

    • Finding about the smoking conditions (yes/no) of patients from clinical letters

    • Grouping similar news articles

    • Classifying reviews into positive and negative sentiments

Pipeline

Text Representation

How to represent a document

  • Represent by a string?

    • No semantic meaning
  • Represent by a list of sentences?

    • Sentence is just like a short document (recursive definition)

Vector space model

  • A vector space is a collection of vectors

  • A vector is an ordered finite list of numbers.

  • Represent documents by concept vectors

    • Each concept defines one dimension

    • A large number of concepts define a high-dimensional space

    • Element of vector corresponds to concept weight

    • The process of converting text into numbers is called Vectorization

  • Distance between the vectors in this concept space

    • Relationship among documents

Vector space model

  • Terms are generic features that can be extracted from text

  • Typically, terms are single words, keywords, n-grams, or phrases

  • Documents are represented as vectors of terms

  • Each dimension (concept) corresponds to a separate term

\[d = (w_1, ..., w_n)\]

An illustration of VS model

  • All documents are projected into this concept space

Vector space model

  • Bag of Words, a Vector Space Model where:

    • Terms: words (more generally we may use n-grams, etc.)

    • Weights: number of occurrences of the terms in the document

    • Document-Term Matrix (DTM), Term-Document Matrix (TDM)

  • Topics (later)

  • Word embeddings (later)

Bag of Words (BOW)

  • Words (terms) and weights as the basis for vector representations of text

    • Doc1: Text mining is to identify useful information.

    • Doc2: Useful information is mined from text.

    • Doc3: Apple is delicious.

BOW weights: Binary

  • Binary

    • with 1 indicating that a term occurred in the document, and 0 indicating that it did not

BOW weights: Raw term frequency

  • Idea: a term is more important if it occurs more frequently in a document

  • use the raw frequency count of term \(t\) in doc \(d\)

BOW weights: TF-IDF

  • Idea: a term is more discriminative if it occurs a lot but only in fewer documents.

  • TF-IDF (term frequency–inverse document frequency) weight:

\[w_{d,t} = TF_{d,t} \cdot IDF_t\]

Let \(n_{d,t}\) denote the number of times term \(t\) appears in document \(d\). The relative frequency of \(t\) in \(d\) is:

\[TF_{d,t} = \frac{n_{d,t}}{\sum_i{n_{d,i}}}\] Let \(N\) denote the number of documents annd \(N_t\) denote the number of documents containing term \(t\).

\[IDF_t = log(\frac{N}{N_t})\]

Classification Algorithms

How to classify this document?

Text classification: Definition

  • Supervised learning

  • Input:

    • A training set of \(m\) manually-labeled documents \((d_1,c_1),\cdots,(d_m,c_m)\)

    • A fixed set of classes \(C = \{c_1, c_2,…, c_J\}\)

  • Output:

    • A learned classifier \(y:d \rightarrow c\)

Hand-coded rules

  • Rules based on combinations of words or other features

  • Rules carefully refined by expert

  • But building and maintaining these rules is expensive

  • Data/Domain specifics

  • Not recommended!

Supervised machine learning

  • Nearest centroid
  • K-nearest neighbors
  • Naïve Bayes
  • Decision tree
  • Random forest
  • Support vector machines
  • Logistic regression
  • Neural networks (deep learning)

Rocchio classifier (Nearest Centroid)

Each class is represented by its centroid, with test samples classified to the class with the nearest centroid. Using a training set of documents, the Rocchio algorithm builds a prototype vector, centroid, for each class. This prototype is an average vector over the training documents’ vectors that belong to a certain class.

\[\boldsymbol{\mu_c} = \frac{1}{|D_c|}\sum_{\mathbf{d} \in D_c}{\mathbf{d}}\]

Where \(D_c\) is the set of documents in the corpus that belongs to class \(c\) and \(d\) is the vector representation of document \(d\).

Rocchio classifier (Nearest Centroid)

The predicted label of document d is the one with the smallest (Euclidean) distance between the document and the centroid.

\[\hat{c} = \arg \min_c ||\boldsymbol{\mu_c} - \mathbf{d}||\]

K-Nearest Neighbor (KNN)

KNN

  • Given a test document \(d\), the KNN algorithm finds the \(k\) nearest neighbors of \(d\) among all the documents in the training set, and scores the category candidates based on the class of the k neighbors.

  • After sorting the score values, the algorithm assigns the candidate to the class with the highest score.

  • The basic nearest neighbors classification uses uniform weights: that is, the value assigned to a query point is computed from a simple majority vote of the nearest neighbors.

  • Can weight the neighbors such that nearer neighbors contribute more to the fit.

Naïve Bayes

Bayes rule

  • Applied to documents and classes

  • For a document \(d\) and a class \(c\)

\[P(c|d) = \frac{P(c)P(d|c)}{P(d)}\]

Multinomial naïve Bayes assumptions

\[C_{NB} = \underset{c \in C}{\operatorname{argmax}}P(c) \cdot P(w_1, w_2, \ldots,w_n|c)\]

  • Bag of Words assumption: Assume position doesn’t matter

  • Conditional Independence: Assume the feature probabilities \(P(w_i|c)\) are independent given the class \(c\).

\[P(w_1, \ldots, w_n|c) = P(w_1 | c) \cdot P(w_2|c) \cdot P(w_3|c) \cdot \ldots \cdot P(w_n|c)\]

  • Hence:

\[C_{NB} = \underset{c \in C}{\operatorname{argmax}}P(c) \cdot P(w_1 | c) \cdot P(w_2|c) \cdot P(w_3|c) \cdot \ldots \cdot P(w_n|c) \] \[C_{NB} = \underset{c \in C}{\operatorname{argmax}}P(c)\prod_{i \in positions}{P(w_i|c)}\]

Parameter estimation

  • First attempt: maximum likelihood estimates

    • simply use the frequencies in the data

\[\hat{P}(c) = \frac{count(C = c)}{N_{doc}}\] \[\hat{P}(w_i|c) = \frac{count(w_i, c)}{\sum_{w \in V}count(w, c)}\]

Problem with Maximum Likelihood

What if we have seen no training documents with the word coffee and classified in the topic positive (thumbs-up)?

\[\hat{P}(\mathrm{''coffee''|positive}) = \frac{count(\mathrm{''coffee'', positive})}{\sum_{w \in V}{count(\mathrm{w,positive})}}\]

Zero probabilities cannot be conditioned away, no matter the other evidence!

\[C_{NB} = \underset{c \in C}{\operatorname{argmax}}P(c)\prod_{i \in positions}{P(w_i|c)}\]

Laplace (add-1) smoothing for naïve Bayes

\[ \hat{P}(w_i|c) = \frac{count(w_i,c)+1}{\sum_{w \in V}{(count(w,c)+1)}} \]

Decision tree

  • A decision tree is a hierarchical decomposition of the (training) data space, where a condition on the feature value is used to divide the data space hierarchically.

  • Top-down, by choosing a variable at each step that best splits the set of items.

  • Different algorithms to measure the homogeneity of the target variable within the subsets (e.g. Gini impurity, information gain)

Random forest

  • Random forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time.
  • Fit multiple trees to bootstrapped samples of the data AND at each node select best predictor from only a random subset of predictors. Combine all trees to yield a consensus prediction

Support Vector Machine (SVM)

  • The main principle of SVM is to determine separators in the search space which can best separate the different classes.
  • SVM tries to make a decision boundary in such a way that the separation between the two classes is as wide as possible.

SVM

  • It is not necessary to use a linear function for the SVM classifier.
  • With the kernel trick, SVM can construct a nonlinear decision surface in the original feature space by mapping the data instances non-linearly to a new space where the classes can be separated linearly with a hyperplane.
  • SVM is quite robust to high dimensionality.

Evaluation

Data splitting

  • Training set

    • Validation set (dev set)
      • A dataset of examples used to tune the hyperparameters (i.e. the architecture) of a classifier. It is sometimes also called the development set or the “dev set”.
  • Test set

K-fold cross validation

Confusion matrix

Accuracy

  • What proportion of instances is correctly classified?

    TP + TN / TP + FP + FN + TN

  • Accuracy is a valid choice of evaluation for classification problems which are well balanced and not skewed.

  • Let us say that our target class is very sparse. Do we want accuracy as a metric of our model performance? What if we are predicting if an asteroid will hit the earth? Just say “No” all the time. And you will be 99% accurate. The model can be reasonably accurate, but not at all valuable.

Precision and recall

  • Precision (also Positive Predictive Value): % of selected/retrieved items that are correct/relevant

  • Recall (also sensitivity): % of correct/relevant items that are selected/retrieved.

  • Precision is a valid choice of evaluation metric when we want to be very sure of our prediction.

  • Recall is a valid choice of evaluation metric when we want to capture as many positives as possible.

A combined measure: F

A combined measure that assesses the precision/recall tradeoff is F measure (weighted harmonic mean):

\[F = \frac{(\beta^2+1)PR}{\beta^2P + R}\] where \(\beta\) is a positive real number and is chosen such that recall is considered \(\beta\) times as important as precision.

Balanced F1 measure: \(\beta = 1\), \(F = 2PR/(P+R)\)

The Real World

No training data?

  1. Manually written rules
  • If (x or y) and not (w or z) then categorize as class1

  • Need careful crafting

  • Low accuracy

  • Domain-specific

  • Time-consuming

  1. Active learning

  2. Unsupervised methods (tomorrow)

Very little data?

  • Use Naïve Bayes, KNN, Rocchio

  • Get more labeled data

  • Find ways to label data

  • Try semi-supervised methods

  • Try transfer learning

A reasonable amount of data?

  • Works with the more complex classifiers

    • SVM
    • Random forest

    • Neural network

A huge amount of data?

  • Can achieve high accuracy!

  • At a cost:

    • SVM
    • NN, deep learning (train time)

Accuracy as a function of data size

How to tweak performance

  • Domain-specific features and weights: very important in real performance

  • Sometimes need to collapse terms:

    • Part numbers, chemical formulas, …

    • But stemming generally doesn’t help

  • Upweighting: Counting a word as if it occurred twice:

    • Title words

    • First sentence of each paragraph (Murata, 1999)

    • In sentences that contain title words

  • Hyperparameter optimization

Summary

Summary

  • Vector space model & BOW
  • Text classification –> supervised learning –> labeled data

  • Evaluation

Practical 3