In this practical, we are going to learn about feature selection and dimension reduction methods for text data.
Today we will use the following libraries. Take care to have them installed!
from sklearn.datasets import load_files
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import mutual_info_classif
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt
1. Here we are going to use a news article data set, originating from BBC news website. This dataset was provided for benchmarking machine learning algorithms. The BBC data set consists of 2,225 documents and 5 categories: business, entertainment, politics, sport, and tech. Upload the data.zip
file and extract it using the code below.
Load the dataset and convert it to a dataframe.
2. Print the unique target names in your data and check the number of articles in each category. Then split your data into training (80%) and test (20%) sets.
3. Use the CountVectorizer
from sklearn
and convert the text data into a document-term matrix. What is the difference between CountVectorizer
and tfidfVectorizer(use_idf=False)
?
The only difference is that the TfidfVectorizer()
returns floats while the CountVectorizer()
returns ints. And that’s to be expected – as explained in the documentation quoted above, TfidfVectorizer()
assigns a score while CountVectorizer()
counts.
4. Print top 20 most frequent words in the training set.
5. From the feature selection library in sklearn
load the SelectKBest
function and apply it on the BBC dataset using the chi-squared method. Extract top 20 features.
6. Repeat the analysis in Question 5 with the mutual information feature selection method. Do you get the same list of words as compared to the chi-squared method?
Now you can build a classifier and train it using the output of these feature selection techniques. We are not going to do this right now, but if you are interested you can transform your training and test set using the selected features and continue with your classifier! Here are some tips:
7. One of the functions for embedded feature selection is the SelectFromModel
function in sklearn
. Use this function with L1 norm SVM and check how many non-zero coefficients left in the model.
8. What are the top features according to the SVM model? Tip: Use the function model.get_support()
to find these features.
9. Create a pipeline with the tfidf representation and a random forest classifier.
10. Fit the pipeline on the training set.
11. Use the pipeline to predict the outcome variable on your test set. Evaluate the performance of the pipeline using the classification_report
function on the test subset. How do you interpret your results?
12. Create your second pipeline with the tfidf representation and a random forest classifier with the addition of an embedded feature selection using the SVM classification method with L1 penalty. Fit the pipeline on your training set and test it with the test set. How does the performance change?
13. Create your third and forth pipelines with the tfidf representation, a chi2 feature selection (with 20 and 200 features for clf3
and clf4
, respectively), and a random forest classifier.
14. We can change the learner by simply plugging a different classifier object into our pipeline. Create your fifth pipeline with L1 norm SVM for the feature selection method and naive Bayes for the classifier. Compare your results on the test set with the previous pipelines.
15. Dimensionality reduction methods such as PCA and SVD can be used to project the data into a lower dimensional space. If you run PCA with your text data, you might end up with the message:
PCA does not support sparse input. See TruncatedSVD for a possible alternative.
Therefore, we will use the TruncatedSVD
function from the sklearn
package and we want to find out how much of the variance in the BBC data set is explained with different components. For this, first create a tfidf matrix and use that to make a co-occurrence matrix.
16. Run the TruncatedSVD
function with different values for components: 1, 2, 4, 5, 10, 15, 20, 50, 100. Plot the explained variance ratio for each component of Truncated SVD.
17. How many components are needed to explain at least 95% of the variance?
18. Use these components and train a SVM model on the BBC dataset. Make a pipeline for your model. Compare your results on the test set with the previous pipelines.