In this practical, we will apply various deep learning models for multiclass text classification. We will work with the famous 20 Newsgroups dataset from the sklearn
library and apply deep learning models using the keras
library.
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. It was originally collected by Ken Lang, and it has become a popular data set for experiments in text applications of machine learning techniques.
Also, we will use the keras
library, which is a deep learning and neural networks API by François Chollet's team capable of running on top of Tensorflow (Google), Theano or CNTK (Microsoft).
Today we will use the following libraries. Take care to have them installed!
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import RandomizedSearchCV
from sklearn.datasets import fetch_20newsgroups
from sklearn.preprocessing import LabelEncoder
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras import layers, utils
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
1. Load the train and test subsets of the 20 Newsgroups data set from sklearn
datasets. Remove the headers, footers and qoutes from the news article when loading data sets. Use number 321 for random_state
.
In order to get faster execution times for this practical we will work on a partial data set with only 5 categories out of the 20 available in the data set: 'rec.sport.hockey'
, 'talk.politics.mideast'
, 'soc.religion.christian'
, 'comp.graphics'
, and 'sci.med'
.
2. Find out about the number of news articles in the train and test sets.
3. Covert the train and test to dataframes.
4. In order to feed predictive deep learning models with text data, first you need to turn the text into vectors of numerical values suitable for statistical analysis. Use the binary representation with TfidfVectorizer
and create document-term matrices for test and train (name them X_train
and X_test
).
5. Use the LabelEncoder
to create y_train
and y_test
from df_train.label.values
and df_test.label.values
, respectively.
6. Use the sequential API in keras
and create a one-hidden-layer neural network. So, the first layer will be the input layer with the number of features in your X_train, followed by a single hidden layer, and an output layer. Set the number of neurons in the hidden layer to 5, and activation function as relu
. For the output layer you can use a softmax
activation function.
The sequential API (https://www.tensorflow.org/guide/keras/sequential_model) allows you to create models layer by layer. It is limited in that it does not allow to create models that share layers or have multiple inputs or outputs.
The functional API (https://www.tensorflow.org/guide/keras/functional) allows you to create models that have a lot more flexibility as you can define models where layers connect to more than just the previous and next layers. In this way, you can connect layers to (literally) any other layer. As a result, creating complex networks such as Siamese neural networks and residual neural networks become possible.
7. The compile
function defines the loss
function, the optimizer
and the evaluation metrics
. Call this function for your neural network model with loss='binary_crossentropy'
, optimizer='adam'
, metrics=['accuracy']
. Check the summary of the model.
Task | Output type | Last-layer activation | Loss function | Metric(s) |
---|---|---|---|---|
Regression | Numerical | Linear | meanSquaredError (MSE), meanAbsoluteError (MAE) |
Same as loss |
Classification | Binary | Sigmoid | binary_crossentropy | Accuracy, precision, recall, sensitivity, TPR, FPR, ROC, AUC |
Classification | Single label, Multiple classes | Softmax | categorical_crossentropy | Accuracy, confusion matrix |
Classification | Multiple labels, Multiple classes | Sigmoid | binary_crossentropy | Accuracy, precision, recall, sensitivity, TPR, FPR, ROC, AUC |
8. Time to train your model! Train your model in 20 iterations. What does batch_size
represent?
Note that if you rerun the fit()
method, you will start off with the computed weights from the previous training. Make sure to call clear_session()
before you start training the model again:
from keras.backend import clear_session
clear_session()
9. Plot the accuracy and loss of your trained model.
10. Evaluate the accuracy of your trained model on the test set. Compare that with the accuracy on train.
11. Use the tokenizer from Keras with 20,000 words and create X-train
and X_test
sequences.
12. Use the pad_sequence()
function to pad each text sequence with zeros, so that each vector has the same length of 100 words.
13. Now it is time to create a neural network model using an embedding layer as input. Take the output of the embedding layer (embedding_dim = 50
) and plug it into a Dense layer with 10 neurons, and the relu
activation function. In order to do this, you have to add a Flatten layer in between that prepares the sequential input for the Dense layer. Note that in the Embedding layer, input_dim
is the size of the vocabulary, output_dim
is the size of the embedding vector, and input_length
is the length of the text sequence.
14. Pretrained word embeddings are the embeddings learned in one task that are used for solving another similar task. These embeddings are trained on large data sets, saved, and then used for solving other tasks. Here, we are going to use the GloVe embeddings which are precomputed word embeddings simply trained on a large corpus of text. For this purpose, we wrote the following fuction to apply on the pretrained word embeddings and use the corresponding word vectors for words in our vocabulary. Download one of the GloVe embeddings (e.g. glove.6B.50d.txt
) and create the embedding matrix using the provided function. (Link to download: https://nlp.stanford.edu/projects/glove/)
15. Build your previous neural network model again, but this time with the initial weights from the pretrained word embeddings. Set the trainable
argument False
so that your embedding layer does not learn the word vectors anymore, and then again back to True
. How does the performances change?