How to do feature selection (FS) for text data?
Is PCA a FS method for text?
Other methods?
How to do feature selection (FS) for text data?
Is PCA a FS method for text?
Other methods?
The additional features typically add noise. Machine learning will pick up on spurious correlations, that might be true in the training set, but not in the test set.
For some ML methods, more features means more parameters to learn (more NN weights, more decision tree nodes, etc…)
The increased space of possibilities is more difficult to search.
Relevance popularity: A term event model based feature selection scheme for text classification: https://doi.org/10.1371/journal.pone.0174341
Why we need FS:
1. To improve performance (in terms of speed, predictive power, simplicity of the model). 2. To visualize the data for model selection. 3. To reduce dimensionality and remove noise.
Feature Selection is a process that chooses an optimal subset of features according to a certain criterion.
Feature selection is the process of selecting a specific subset of the terms of the training set and using only them in the classification algorithm.
Let \(p(c | t)\) be the conditional probability that a document belongs to class \(c\), given the fact that it contains the term \(t\). Therefore, we have:
\[\sum^k_{c=1}{p(c | t)=1}\]
Then, the gini-index for the term \(t\), denoted by \(G(t)\) is defined as:
\[G(t) = \sum^k_{c=1}{p(c | t)^2}\]
The value of the gini-index lies in the range \((1/k, 1)\).
Higher values of the gini-index indicate a greater discriminative power of the term \(t\).
\({\chi}^2\) statistics with multiple categories
\({\chi}^2=\sum_c{p(c) {\chi}^2(c,t)}\)
\({\chi}^2(t) = \underset{c}{max}\ {\chi}^2(c,t)\)
Many other metrics (Same trick as in \(\chi^2\) statistics for multi-class cases)
Mutual information
\[PMI(t;c) = p(t,c)log(\frac{p(t,c)}{p(t)p(c)})\]
Odds ratio
\[Odds(t;c) = \frac{p(t,c)}{1 - p(t,c)} \times \frac{1 - p(t,\bar{c})}{p(t,\bar{c})}\]
\[\min_\alpha{\hat{R}(\alpha, \sigma)} = \min_\alpha{\sum_{k=1}^m{L(f(\alpha, \sigma \circ x_k), y_k) + \Omega(\alpha)}}\]
PCA is one of the most common feature reduction techniques
A linear method for dimensionality reduction
Allows us to combine much of the information contained in \(n\) features into \(p\) features where \(p < n\)
PCA is unsupervised in that it does not consider the output class / value of an instance – There are other algorithms which do (e.g. LDA: Linear Discriminant Analysis)
PCA works well in many cases where data have mostly linear correlations
https://scikit-learn.org/stable/modules/cross_validation.html
What proportion of instances is correctly classified?
(TP + TN) / (TP + FP + FN + TN)
Accuracy is a valid choice of evaluation for classification problems which are well balanced and not skewed.
Let us say that our target class is very sparse. Do we want accuracy as a metric of our model performance? What if we are predicting if an asteroid will hit the earth? Just say “No” all the time. And you will be 99% accurate. The model can be reasonably accurate, but not at all valuable.
Precision: % of selected items that are correct
Recall: % of correct items that are selected
Precision is a valid choice of evaluation metric when we want to be very sure of our prediction.
Recall is a valid choice of evaluation metric when we want to capture as many positives as possible.
A combined measure that assesses the P/R tradeoff is F measure (weighted harmonic mean):
\[F = \frac{1}{\alpha\frac{1}{P}+(1-\alpha)\frac{1}{R}}=\frac{(\beta^2+1)PR}{\beta^2P + R}\]
The harmonic mean is a very conservative average;
Balanced F1 measure - i.e., with \(\beta = 1\) (that is, \(\alpha = 1/2\)): \(F = 2PR/(P+R)\)