Practical 6: Deep Learning for Multiclass Text Classification¶

Ayoub Bagheri¶

logo

Applied Text Mining - Utrecht Summer School¶

In this practical, we will apply various deep learning models for multiclass text classification. We will work with the famous 20 Newsgroups dataset from the sklearn library and apply deep learning models using the keras library.

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. It was originally collected by Ken Lang, and it has become a popular data set for experiments in text applications of machine learning techniques.

Also, we will use the keras library, which is a deep learning and neural networks API by François Chollet's team capable of running on top of Tensorflow (Google), Theano or CNTK (Microsoft).

Today we will use the following libraries. Take care to have them installed!

In [ ]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import RandomizedSearchCV
from sklearn.datasets import fetch_20newsgroups
from sklearn.preprocessing import LabelEncoder

from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras import layers, utils

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

Let's get started!¶

1. Load the train and test subsets of the 20 Newsgroups data set from sklearn datasets. Remove the headers, footers and qoutes from the news article when loading data sets. Use number 321 for random_state. In order to get faster execution times for this practical we will work on a partial data set with only 5 categories out of the 20 available in the data set: 'rec.sport.hockey', 'talk.politics.mideast', 'soc.religion.christian', 'comp.graphics', and 'sci.med'.

2. Find out about the number of news articles in the train and test sets.

3. Covert the train and test to dataframes.

Train a neural network a with document-term matrix¶

4. In order to feed predictive deep learning models with text data, first you need to turn the text into vectors of numerical values suitable for statistical analysis. Use the binary representation with TfidfVectorizer and create document-term matrices for test and train (name them X_train and X_test).

5. Use the LabelEncoder to create y_train and y_test from df_train.label.values and df_test.label.values, respectively.

6. Use the sequential API in keras and create a one-hidden-layer neural network. So, the first layer will be the input layer with the number of features in your X_train, followed by a single hidden layer, and an output layer. Set the number of neurons in the hidden layer to 5, and activation function as relu. For the output layer you can use a softmax activation function.

The sequential API (https://www.tensorflow.org/guide/keras/sequential_model) allows you to create models layer by layer. It is limited in that it does not allow to create models that share layers or have multiple inputs or outputs.

The functional API (https://www.tensorflow.org/guide/keras/functional) allows you to create models that have a lot more flexibility as you can define models where layers connect to more than just the previous and next layers. In this way, you can connect layers to (literally) any other layer. As a result, creating complex networks such as Siamese neural networks and residual neural networks become possible.

7. The compile function defines the loss function, the optimizer and the evaluation metrics. Call this function for your neural network model with loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']. Check the summary of the model.

Task Output type Last-layer activation Loss function Metric(s)
Regression Numerical Linear meanSquaredError (MSE),
meanAbsoluteError (MAE)
Same as loss
Classification Binary Sigmoid binary_crossentropy Accuracy, precision, recall, sensitivity,
TPR, FPR, ROC, AUC
Classification Single label, Multiple classes Softmax categorical_crossentropy Accuracy, confusion matrix
Classification Multiple labels, Multiple classes Sigmoid binary_crossentropy Accuracy, precision, recall, sensitivity,
TPR, FPR, ROC, AUC

8. Time to train your model! Train your model in 20 iterations. What does batch_size represent?

Note that if you rerun the fit() method, you will start off with the computed weights from the previous training. Make sure to call clear_session() before you start training the model again:


from keras.backend import clear_session
clear_session()


9. Plot the accuracy and loss of your trained model.

10. Evaluate the accuracy of your trained model on the test set. Compare that with the accuracy on train.

The embedding layer¶

11. Use the tokenizer from Keras with 20,000 words and create X-train and X_test sequences.

12. Use the pad_sequence() function to pad each text sequence with zeros, so that each vector has the same length of 100 words.

13. Now it is time to create a neural network model using an embedding layer as input. Take the output of the embedding layer (embedding_dim = 50) and plug it into a Dense layer with 10 neurons, and the relu activation function. In order to do this, you have to add a Flatten layer in between that prepares the sequential input for the Dense layer. Note that in the Embedding layer, input_dim is the size of the vocabulary, output_dim is the size of the embedding vector, and input_length is the length of the text sequence.

Pretrained word embeddings¶

14. Pretrained word embeddings are the embeddings learned in one task that are used for solving another similar task. These embeddings are trained on large data sets, saved, and then used for solving other tasks. Here, we are going to use the GloVe embeddings which are precomputed word embeddings simply trained on a large corpus of text. For this purpose, we wrote the following fuction to apply on the pretrained word embeddings and use the corresponding word vectors for words in our vocabulary. Download one of the GloVe embeddings (e.g. glove.6B.50d.txt) and create the embedding matrix using the provided function. (Link to download: https://nlp.stanford.edu/projects/glove/)

15. Build your previous neural network model again, but this time with the initial weights from the pretrained word embeddings. Set the trainable argument False so that your embedding layer does not learn the word vectors anymore, and then again back to True. How does the performances change?