In this practical, we are going to work with BERT! More specifically, we are going to perform sentiment analysis of movie reviews using a transformer model, have a look under its hood, and try to explain the model predictions using SHAP.
Our BERT-variant of choice is DistilBERT, a light-weight transformer whose performance is comparable to Google's BERT base model. From the authors:
[W]e leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster.
In Part 1, we will use an off-the-shelf sentiment analysis pipeline from the Hugging Face transformers
module to classify two movie reviews.
In Part 2, we will dissasemble the sentiment analysis pipeline by performing the same analysis as in Part 1 step-by-step.
In Part 3, we will open the black box and explore which tokens were most important for DistilBERT's sentiment classification. We do this using Shapley Additive Explanations (SHAP).
In Part 4, we will fine-tune DistilBERT on the IMDB movie review dataset.
Fine-tuning a transformer model is quite resource-intensive! Switch your runtime type to GPU T4 under Runtime > Change runtime type.
Running this practical requires a more recent version of the accelerate
package than installed by default in Google Colab. Run the code below to upgrade accelerate
.
!pip install -q -U accelerate # update accelerate
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 244.2/244.2 kB 3.2 MB/s eta 0:00:00
Now restart your runtime under Runtime > Restart runtime or by pressing ctrl + M .
and clicking Yes in the pop-up message.
All set? 🙂
Since sentiment analysis is a popular application, there are off-the-shelf pipelines which we can use to quickly classify documents by sentiment. One such pipeline is part of the Hugging Face transformers
module.
🤗 Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models. Using pretrained models can reduce your compute costs, carbon footprint, and save you the time and resources required to train a model from scratch. [ ... ] The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering. See the task summary for examples of use. [ ... ] The `pipeline()` is the most powerful object encapsulating all other pipelines.
We install the transformers
module from which we import pipeline
.
!pip install -q transformers
!pip install -q Xformers
from transformers import pipeline
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.4/7.4 MB 15.0 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 268.8/268.8 kB 25.5 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.8/7.8 MB 34.8 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 44.5 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 109.1/109.1 MB 8.4 MB/s eta 0:00:00
#for reproducibility
from transformers import set_seed
import random
import numpy as np
seed = 137
set_seed(seed)
random.seed(seed)
np.random.seed(seed)
Pre-trained BERT models are available for many different natural language processing tasks based on the General Language Understanding Evaluation (GLUE) benchmark resources.
To showcase how to use the sentiment analysis pipeline, we will compare two relatively complex IMDB reviews of Mark Mylod's 2022 movie The Menu (2022). Load the following two reviews:
review1 = "The Menu isn't the first to satirise the rich and their incompetence and isn't saying anything new \
but that definitely doesn't prevent it from being a great satire that pokes fun at everything it can in ways that \
are often consistently funny, playful and extremely stylish. Ralph Fiennes gives a terrific performance full of awkward\
unease that only enhances his commanding screen presence. Anya Taylor-Joy is a perfect audience surrogate amongst a sea\
of deliberately unlikeable characters of which the best is Nicholas Hoult whose almost too good at making his character\
hilariously pathetic. Mark Mylod's direction is excellent, the film has more than enough visual style to match the \
pretentiousness of its characters and is really good at building tension. The music by Colin Stetson is fantastic, \
striking a unusual balance between beautiful and unnerving."
review2 = "This looked like an interesting film based on the trailer and the first half of it was just that. \
The tension and suspense was building nicely. There were little dribs and drabs and hints of what might be coming \
without being too obvious. The acting from everyone in the film was good. Even supporting characters with only a few \
lines. Were well realized I remember thinking that I couldn't wait to see where it was all going. Sadly it didn't \
really go anywhere. It all unwound in the second half. The acting was still on but the writing failed. That's the most \
i can say without giving up any spoilers. And that was extra disappointing because the first half was so good. This \
Menu did not deliver the meal as advertised."
You can skim the reviews.
print(review1)
The Menu isn't the first to satirise the rich and their incompetence and isn't saying anything new but that definitely doesn't prevent it from being a great satire that pokes fun at everything it can in ways that are often consistently funny, playful and extremely stylish. Ralph Fiennes gives a terrific performance full of awkwardunease that only enhances his commanding screen presence. Anya Taylor-Joy is a perfect audience surrogate amongst a seaof deliberately unlikeable characters of which the best is Nicholas Hoult whose almost too good at making his characterhilariously pathetic. Mark Mylod's direction is excellent, the film has more than enough visual style to match the pretentiousness of its characters and is really good at building tension. The music by Colin Stetson is fantastic, striking a unusual balance between beautiful and unnerving.
print(review2)
This looked like an interesting film based on the trailer and the first half of it was just that. The tension and suspense was building nicely. There were little dribs and drabs and hints of what might be coming without being too obvious. The acting from everyone in the film was good. Even supporting characters with only a few lines. Were well realized I remember thinking that I couldn't wait to see where it was all going. Sadly it didn't really go anywhere. It all unwound in the second half. The acting was still on but the writing failed. That's the most i can say without giving up any spoilers. And that was extra disappointing because the first half was so good. This Menu did not deliver the meal as advertised.
What is your guess of the sentiment of the following reviews? On the scale 1-10, what rating do you think the respective authors gave the movie?
1. Set up and fit a sentiment analysis pipeline to predict the sentiment of the two reviews. Define the model as 'distilbert-base-uncased-finetuned-sst-2-english'
Our BERT model will be Distilbert base uncased. Uncased means that the model disregards casing (upper or lower case). In particular, we use a distilbert version which has been fine-tuned for binary sentiment classification using the Stanford Sentiment Treebank (SST-2; Pang and Lee, 2005) corpus.
sentiment_pipeline = pipeline("sentiment-analysis", model = 'distilbert-base-uncased-finetuned-sst-2-english')
Downloading (…)lve/main/config.json: 0%| | 0.00/629 [00:00<?, ?B/s]
Downloading model.safetensors: 0%| | 0.00/268M [00:00<?, ?B/s]
Downloading (…)okenizer_config.json: 0%| | 0.00/48.0 [00:00<?, ?B/s]
Downloading (…)solve/main/vocab.txt: 0%| | 0.00/232k [00:00<?, ?B/s]
sentiment_pipeline(review1) # predict sentiment
[{'label': 'POSITIVE', 'score': 0.9983012080192566}]
sentiment_pipeline(review2) # predict sentiment
[{'label': 'NEGATIVE', 'score': 0.9622442722320557}]
Now we are going to show you how to build your own sentiment analysis pipeline from scratch. In practice, you can use the already existing one as we just have above, but it might be helpful to understand the steps associated with setting up a transformer-based pipeline for other applications you might work on.
We perform the same sentiment analysis on the same two reviews - this time step-by-step.
2. Define the tokenizer and model. For the tokenizer, use the pretrained DistilBERT tokenizer and for the model use distilbert-base-uncased-finetuned-sst-2-english
.
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
3. Tokenize the review1
and review2
objects. Pad and truncate the sequences, and return PyTorch (pt
) tensors. Save the output object as encoding
.
encoding = tokenizer([review1, review2], padding = True, truncation = True, return_tensors = 'pt') # tokenize the reviews
BERT and several other transformer models use tokenizers based on WordPiece, a subword tokenization algorithm. The main advantage of a subword tokenizer is that it interpolates between word-based and character-based tokenization. Common words get a slot in the vocabulary, but the tokenizer can fall back to word pieces and individual characters for unknown words.
Since batched inputs (our reviews) are of different lengths, they cannot be converted to fixed-size tensors to befed to the model.
There are two main strategies for solving this problem -- padding and truncation.
In order to create rectangular tensors from batches of varying lengths, padding adds a special padding token to ensure shorter sequences will have the same length as either the longest sequence in a batch or the maximum length accepted by the model. Truncation works in the other direction by truncating long sequences.
`padding = True`: pad to the longest sequence in the batch (no padding is applied if you only provide a single sequence). `truncation = True`: truncate to a maximum length specified by the max_length argument or the maximum length accepted by the model if no max_length is provided (max_length=None).
4. Inspect the encoding
object by prining the first reveiw's input ids.
print(encoding['input_ids'][0]) # first review's input_ids
tensor([ 101, 1996, 12183, 3475, 1005, 1056, 1996, 2034, 2000, 2938, 15735, 3366, 1996, 4138, 1998, 2037, 4297, 25377, 12870, 5897, 1998, 3475, 1005, 1056, 3038, 2505, 2047, 2021, 2008, 5791, 2987, 1005, 1056, 4652, 2009, 2013, 2108, 1037, 2307, 18312, 2008, 26202, 2015, 4569, 2012, 2673, 2009, 2064, 1999, 3971, 2008, 2024, 2411, 10862, 6057, 1010, 18378, 1998, 5186, 2358, 8516, 4509, 1012, 6798, 10882, 24336, 2015, 3957, 1037, 27547, 2836, 2440, 1997, 9596, 9816, 11022, 2008, 2069, 11598, 2015, 2010, 7991, 3898, 3739, 1012, 21728, 4202, 1011, 6569, 2003, 1037, 3819, 4378, 7505, 21799, 5921, 1037, 2712, 11253, 9969, 4406, 3085, 3494, 1997, 2029, 1996, 2190, 2003, 6141, 7570, 11314, 3005, 2471, 2205, 2204, 2012, 2437, 2010, 2839, 26415, 9488, 27191, 17203, 1012, 2928, 2026, 4135, 2094, 1005, 1055, 3257, 2003, 6581, 1010, 1996, 2143, 2038, 2062, 2084, 2438, 5107, 2806, 2000, 2674, 1996, 3653, 6528, 20771, 2791, 1997, 2049, 3494, 1998, 2003, 2428, 2204, 2012, 2311, 6980, 1012, 1996, 2189, 2011, 6972, 26261, 25656, 2003, 10392, 1010, 8478, 1037, 5866, 5703, 2090, 3376, 1998, 4895, 3678, 6455, 1012, 102])
We see that BERT assigns a unique id to each token (input_ids
).
5. Convert first review's input ids to tokens using convert_ids_to_tokens
to see how the text got tokenized.
print(tokenizer.convert_ids_to_tokens(encoding['input_ids'][0])) # first review's tokens
['[CLS]', 'the', 'menu', 'isn', "'", 't', 'the', 'first', 'to', 'sat', '##iri', '##se', 'the', 'rich', 'and', 'their', 'inc', '##omp', '##ete', '##nce', 'and', 'isn', "'", 't', 'saying', 'anything', 'new', 'but', 'that', 'definitely', 'doesn', "'", 't', 'prevent', 'it', 'from', 'being', 'a', 'great', 'satire', 'that', 'poke', '##s', 'fun', 'at', 'everything', 'it', 'can', 'in', 'ways', 'that', 'are', 'often', 'consistently', 'funny', ',', 'playful', 'and', 'extremely', 'st', '##yl', '##ish', '.', 'ralph', 'fi', '##enne', '##s', 'gives', 'a', 'terrific', 'performance', 'full', 'of', 'awkward', '##une', '##ase', 'that', 'only', 'enhance', '##s', 'his', 'commanding', 'screen', 'presence', '.', 'anya', 'taylor', '-', 'joy', 'is', 'a', 'perfect', 'audience', 'sur', '##rogate', 'amongst', 'a', 'sea', '##of', 'deliberately', 'unlike', '##able', 'characters', 'of', 'which', 'the', 'best', 'is', 'nicholas', 'ho', '##ult', 'whose', 'almost', 'too', 'good', 'at', 'making', 'his', 'character', '##hila', '##rio', '##usly', 'pathetic', '.', 'mark', 'my', '##lo', '##d', "'", 's', 'direction', 'is', 'excellent', ',', 'the', 'film', 'has', 'more', 'than', 'enough', 'visual', 'style', 'to', 'match', 'the', 'pre', '##ten', '##tious', '##ness', 'of', 'its', 'characters', 'and', 'is', 'really', 'good', 'at', 'building', 'tension', '.', 'the', 'music', 'by', 'colin', 'ste', '##tson', 'is', 'fantastic', ',', 'striking', 'a', 'unusual', 'balance', 'between', 'beautiful', 'and', 'un', '##ner', '##ving', '.', '[SEP]']
Note that BERT-based models also operate with special tokens:
Token | Token ID | Meaning |
---|---|---|
[CLS] |
101 |
Beginning of input |
[SEP] |
102 |
End of input or sentence |
[MASK] |
103 |
Masked tokens the model should predict |
[PAD] |
0 |
Padding |
[UNK] |
100 |
Unknown token not in training data |
</blockquote>
6. Predict the sentiment of the two reviews. In order to do this, import torch
, and define the output
object using the model, input ids and attention mask.
Now we are ready to do some sentiment prediction. We import torch
, define our model by feeding it input_ids
and attention_mask
. The attention mask is a binary tensor indicating the position of the padded indices so that the model does not attend to them.
# prediction of sentiment
import torch
output = model(input_ids = encoding['input_ids'], attention_mask = encoding['attention_mask'])
print("Predicted logits:\n\n", output['logits']) # logits
Predicted logits: tensor([[-3.1107, 3.2654], [ 1.8161, -1.4221]], grad_fn=<AddmmBackward0>)
print("Predicted probabilities:\n\n", torch.nn.functional.softmax(output['logits'], dim=-1)) # from logits to probabilities
Predicted probabilities: tensor([[0.0017, 0.9983], [0.9622, 0.0378]], grad_fn=<SoftmaxBackward0>)
prediction = torch.argmax(output['logits'], 1) # from logits to binary class
print("Predicted classes:\n", prediction)
Predicted classes: tensor([1, 0])
How do the output sentiments and probabilities compare to the off-the-shelf sentiment classification pipeline we used in Part 1?
Now that we have classified our two reviews, we might want to explain DistilBERT's predictions using Shapley Additive Values (SHAP).
7. Install the shap
module, import shap.Explainer
and feed it the sentiment_pipeline
model. Pass the two movie reviews as input for the explainer.
Note. The computation of Shapley values for DistilBERT explaining our two reviews should take about 5 minutes, but can be very computationally intensive in most real life applications.
!pip install -q shap
import shap
explainer = shap.Explainer(sentiment_pipeline)
shap_values = explainer([review1, review2])
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 547.9/547.9 kB 8.7 MB/s eta 0:00:00
0%| | 0/498 [00:00<?, ?it/s]
Partition explainer: 50%|█████ | 1/2 [00:00<?, ?it/s]
0%| | 0/498 [00:00<?, ?it/s]
Partition explainer: 3it [05:11, 155.63s/it]
8. A nice thing about the shap
module is that it comes with a built-in visualizer. Use shap.plots.text
to visualize the shap values for the first and the second movie review.
shap.plots.text(shap_values[0]) # first review