Open In Colab

Practical 10: Transformers for Sentiment Analysis¶

Daniel Anadria¶

logo

Applied Text Mining - Utrecht Summer School¶

In this practical, we are going to work with BERT! More specifically, we are going to perform sentiment analysis of movie reviews using a transformer model, have a look under its hood, and try to explain the model predictions using SHAP.

Our BERT-variant of choice is DistilBERT, a light-weight transformer whose performance is comparable to Google's BERT base model. From the authors:

[W]e leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster.

Overview¶

In Part 1, we will use an off-the-shelf sentiment analysis pipeline from the Hugging Face transformers module to classify two movie reviews.

In Part 2, we will dissasemble the sentiment analysis pipeline by performing the same analysis as in Part 1 step-by-step.

In Part 3, we will open the black box and explore which tokens were most important for DistilBERT's sentiment classification. We do this using Shapley Additive Explanations (SHAP).

In Part 4, we will fine-tune DistilBERT on the IMDB movie review dataset.

Prepare the Colab Environment¶

Fine-tuning a transformer model is quite resource-intensive! Switch your runtime type to GPU T4 under Runtime > Change runtime type.

Running this practical requires a more recent version of the accelerate package than installed by default in Google Colab. Run the code below to upgrade accelerate.

In [1]:
!pip install -q -U accelerate # update accelerate
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 244.2/244.2 kB 3.2 MB/s eta 0:00:00

Now restart your runtime under Runtime > Restart runtime or by pressing ctrl + M . and clicking Yes in the pop-up message.

All set? 🙂

Part 1: Off-the-shelf sentiment analysis pipeline¶

Since sentiment analysis is a popular application, there are off-the-shelf pipelines which we can use to quickly classify documents by sentiment. One such pipeline is part of the Hugging Face transformers module.

🤗 Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models. Using pretrained models can reduce your compute costs, carbon footprint, and save you the time and resources required to train a model from scratch. [ ... ] The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering. See the task summary for examples of use. [ ... ] The `pipeline()` is the most powerful object encapsulating all other pipelines.

We install the transformers module from which we import pipeline.

In [1]:
!pip install -q transformers
!pip install -q Xformers
from transformers import pipeline
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.4/7.4 MB 15.0 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 268.8/268.8 kB 25.5 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.8/7.8 MB 34.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 44.5 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 109.1/109.1 MB 8.4 MB/s eta 0:00:00
In [2]:
#for reproducibility

from transformers import set_seed
import random
import numpy as np
seed = 137
set_seed(seed)
random.seed(seed)
np.random.seed(seed)

Pre-trained BERT models are available for many different natural language processing tasks based on the General Language Understanding Evaluation (GLUE) benchmark resources.

To showcase how to use the sentiment analysis pipeline, we will compare two relatively complex IMDB reviews of Mark Mylod's 2022 movie The Menu (2022). Load the following two reviews:

In [3]:
review1 = "The Menu isn't the first to satirise the rich and their incompetence and isn't saying anything new \
but that definitely doesn't prevent it from being a great satire that pokes fun at everything it can in ways that \
are often consistently funny, playful and extremely stylish. Ralph Fiennes gives a terrific performance full of awkward\
unease that only enhances his commanding screen presence. Anya Taylor-Joy is a perfect audience surrogate amongst a sea\
of deliberately unlikeable characters of which the best is Nicholas Hoult whose almost too good at making his character\
hilariously pathetic. Mark Mylod's direction is excellent, the film has more than enough visual style to match the \
pretentiousness of its characters and is really good at building tension. The music by Colin Stetson is fantastic, \
striking a unusual balance between beautiful and unnerving."
In [4]:
review2 = "This looked like an interesting film based on the trailer and the first half of it was just that. \
The tension and suspense was building nicely. There were little dribs and drabs and hints of what might be coming \
without being too obvious. The acting from everyone in the film was good. Even supporting characters with only a few \
lines. Were well realized I remember thinking that I couldn't wait to see where it was all going. Sadly it didn't \
really go anywhere. It all unwound in the second half. The acting was still on but the writing failed. That's the most \
i can say without giving up any spoilers. And that was extra disappointing because the first half was so good. This \
Menu did not deliver the meal as advertised."

You can skim the reviews.

In [5]:
print(review1)
The Menu isn't the first to satirise the rich and their incompetence and isn't saying anything new but that definitely doesn't prevent it from being a great satire that pokes fun at everything it can in ways that are often consistently funny, playful and extremely stylish. Ralph Fiennes gives a terrific performance full of awkwardunease that only enhances his commanding screen presence. Anya Taylor-Joy is a perfect audience surrogate amongst a seaof deliberately unlikeable characters of which the best is Nicholas Hoult whose almost too good at making his characterhilariously pathetic. Mark Mylod's direction is excellent, the film has more than enough visual style to match the pretentiousness of its characters and is really good at building tension. The music by Colin Stetson is fantastic, striking a unusual balance between beautiful and unnerving.
In [6]:
print(review2)
This looked like an interesting film based on the trailer and the first half of it was just that. The tension and suspense was building nicely. There were little dribs and drabs and hints of what might be coming without being too obvious. The acting from everyone in the film was good. Even supporting characters with only a few lines. Were well realized I remember thinking that I couldn't wait to see where it was all going. Sadly it didn't really go anywhere. It all unwound in the second half. The acting was still on but the writing failed. That's the most i can say without giving up any spoilers. And that was extra disappointing because the first half was so good. This Menu did not deliver the meal as advertised.

What is your guess of the sentiment of the following reviews? On the scale 1-10, what rating do you think the respective authors gave the movie?

1. Set up and fit a sentiment analysis pipeline to predict the sentiment of the two reviews. Define the model as 'distilbert-base-uncased-finetuned-sst-2-english'

Our BERT model will be Distilbert base uncased. Uncased means that the model disregards casing (upper or lower case). In particular, we use a distilbert version which has been fine-tuned for binary sentiment classification using the Stanford Sentiment Treebank (SST-2; Pang and Lee, 2005) corpus.

In [7]:
sentiment_pipeline = pipeline("sentiment-analysis", model = 'distilbert-base-uncased-finetuned-sst-2-english')
Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]
Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]
Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]
Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]
In [8]:
sentiment_pipeline(review1) # predict sentiment
Out[8]:
[{'label': 'POSITIVE', 'score': 0.9983012080192566}]
In [9]:
sentiment_pipeline(review2) # predict sentiment
Out[9]:
[{'label': 'NEGATIVE', 'score': 0.9622442722320557}]

For each review, we see the output label and the associated probability.

Do you agree with model predictions?

Here is the ground truth:

Review 1 is a positive review with a rating of 8/10.

Review 2 is a negative review with a rating of 4/10.

Does this match your human prediction?

Now we are going to show you how to build your own sentiment analysis pipeline from scratch. In practice, you can use the already existing one as we just have above, but it might be helpful to understand the steps associated with setting up a transformer-based pipeline for other applications you might work on.

Part 2: Sentiment Analysis Pipeline - Deconstructed¶

We perform the same sentiment analysis on the same two reviews - this time step-by-step.

2. Define the tokenizer and model. For the tokenizer, use the pretrained DistilBERT tokenizer and for the model use distilbert-base-uncased-finetuned-sst-2-english.

In [10]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

3. Tokenize the review1 and review2 objects. Pad and truncate the sequences, and return PyTorch (pt) tensors. Save the output object as encoding.

In [11]:
encoding = tokenizer([review1, review2], padding = True, truncation = True, return_tensors = 'pt') # tokenize the reviews

BERT and several other transformer models use tokenizers based on WordPiece, a subword tokenization algorithm. The main advantage of a subword tokenizer is that it interpolates between word-based and character-based tokenization. Common words get a slot in the vocabulary, but the tokenizer can fall back to word pieces and individual characters for unknown words.

Since batched inputs (our reviews) are of different lengths, they cannot be converted to fixed-size tensors to befed to the model.

There are two main strategies for solving this problem -- padding and truncation.

In order to create rectangular tensors from batches of varying lengths, padding adds a special padding token to ensure shorter sequences will have the same length as either the longest sequence in a batch or the maximum length accepted by the model. Truncation works in the other direction by truncating long sequences.

`padding = True`: pad to the longest sequence in the batch (no padding is applied if you only provide a single sequence). `truncation = True`: truncate to a maximum length specified by the max_length argument or the maximum length accepted by the model if no max_length is provided (max_length=None).

4. Inspect the encoding object by prining the first reveiw's input ids.

In [12]:
print(encoding['input_ids'][0]) # first review's input_ids
tensor([  101,  1996, 12183,  3475,  1005,  1056,  1996,  2034,  2000,  2938,
        15735,  3366,  1996,  4138,  1998,  2037,  4297, 25377, 12870,  5897,
         1998,  3475,  1005,  1056,  3038,  2505,  2047,  2021,  2008,  5791,
         2987,  1005,  1056,  4652,  2009,  2013,  2108,  1037,  2307, 18312,
         2008, 26202,  2015,  4569,  2012,  2673,  2009,  2064,  1999,  3971,
         2008,  2024,  2411, 10862,  6057,  1010, 18378,  1998,  5186,  2358,
         8516,  4509,  1012,  6798, 10882, 24336,  2015,  3957,  1037, 27547,
         2836,  2440,  1997,  9596,  9816, 11022,  2008,  2069, 11598,  2015,
         2010,  7991,  3898,  3739,  1012, 21728,  4202,  1011,  6569,  2003,
         1037,  3819,  4378,  7505, 21799,  5921,  1037,  2712, 11253,  9969,
         4406,  3085,  3494,  1997,  2029,  1996,  2190,  2003,  6141,  7570,
        11314,  3005,  2471,  2205,  2204,  2012,  2437,  2010,  2839, 26415,
         9488, 27191, 17203,  1012,  2928,  2026,  4135,  2094,  1005,  1055,
         3257,  2003,  6581,  1010,  1996,  2143,  2038,  2062,  2084,  2438,
         5107,  2806,  2000,  2674,  1996,  3653,  6528, 20771,  2791,  1997,
         2049,  3494,  1998,  2003,  2428,  2204,  2012,  2311,  6980,  1012,
         1996,  2189,  2011,  6972, 26261, 25656,  2003, 10392,  1010,  8478,
         1037,  5866,  5703,  2090,  3376,  1998,  4895,  3678,  6455,  1012,
          102])

We see that BERT assigns a unique id to each token (input_ids).

5. Convert first review's input ids to tokens using convert_ids_to_tokens to see how the text got tokenized.

In [13]:
print(tokenizer.convert_ids_to_tokens(encoding['input_ids'][0])) # first review's tokens
['[CLS]', 'the', 'menu', 'isn', "'", 't', 'the', 'first', 'to', 'sat', '##iri', '##se', 'the', 'rich', 'and', 'their', 'inc', '##omp', '##ete', '##nce', 'and', 'isn', "'", 't', 'saying', 'anything', 'new', 'but', 'that', 'definitely', 'doesn', "'", 't', 'prevent', 'it', 'from', 'being', 'a', 'great', 'satire', 'that', 'poke', '##s', 'fun', 'at', 'everything', 'it', 'can', 'in', 'ways', 'that', 'are', 'often', 'consistently', 'funny', ',', 'playful', 'and', 'extremely', 'st', '##yl', '##ish', '.', 'ralph', 'fi', '##enne', '##s', 'gives', 'a', 'terrific', 'performance', 'full', 'of', 'awkward', '##une', '##ase', 'that', 'only', 'enhance', '##s', 'his', 'commanding', 'screen', 'presence', '.', 'anya', 'taylor', '-', 'joy', 'is', 'a', 'perfect', 'audience', 'sur', '##rogate', 'amongst', 'a', 'sea', '##of', 'deliberately', 'unlike', '##able', 'characters', 'of', 'which', 'the', 'best', 'is', 'nicholas', 'ho', '##ult', 'whose', 'almost', 'too', 'good', 'at', 'making', 'his', 'character', '##hila', '##rio', '##usly', 'pathetic', '.', 'mark', 'my', '##lo', '##d', "'", 's', 'direction', 'is', 'excellent', ',', 'the', 'film', 'has', 'more', 'than', 'enough', 'visual', 'style', 'to', 'match', 'the', 'pre', '##ten', '##tious', '##ness', 'of', 'its', 'characters', 'and', 'is', 'really', 'good', 'at', 'building', 'tension', '.', 'the', 'music', 'by', 'colin', 'ste', '##tson', 'is', 'fantastic', ',', 'striking', 'a', 'unusual', 'balance', 'between', 'beautiful', 'and', 'un', '##ner', '##ving', '.', '[SEP]']

Note that BERT-based models also operate with special tokens:


Token Token ID Meaning
[CLS] 101 Beginning of input
[SEP] 102 End of input or sentence
[MASK] 103 Masked tokens the model should predict
[PAD] 0 Padding
[UNK] 100 Unknown token not in training data

</blockquote>

6. Predict the sentiment of the two reviews. In order to do this, import torch, and define the output object using the model, input ids and attention mask.

Now we are ready to do some sentiment prediction. We import torch, define our model by feeding it input_ids and attention_mask. The attention mask is a binary tensor indicating the position of the padded indices so that the model does not attend to them.

In [14]:
# prediction of sentiment
import torch
output = model(input_ids = encoding['input_ids'], attention_mask = encoding['attention_mask'])
In [15]:
print("Predicted logits:\n\n", output['logits']) # logits
Predicted logits:

 tensor([[-3.1107,  3.2654],
        [ 1.8161, -1.4221]], grad_fn=<AddmmBackward0>)
In [16]:
print("Predicted probabilities:\n\n", torch.nn.functional.softmax(output['logits'], dim=-1)) # from logits to probabilities
Predicted probabilities:

 tensor([[0.0017, 0.9983],
        [0.9622, 0.0378]], grad_fn=<SoftmaxBackward0>)
In [17]:
prediction = torch.argmax(output['logits'], 1) # from logits to binary class
print("Predicted classes:\n", prediction)
Predicted classes:
 tensor([1, 0])

How do the output sentiments and probabilities compare to the off-the-shelf sentiment classification pipeline we used in Part 1?

Part 3: Feature importance with SHAP¶

Now that we have classified our two reviews, we might want to explain DistilBERT's predictions using Shapley Additive Values (SHAP).

7. Install the shap module, import shap.Explainer and feed it the sentiment_pipeline model. Pass the two movie reviews as input for the explainer.

Note. The computation of Shapley values for DistilBERT explaining our two reviews should take about 5 minutes, but can be very computationally intensive in most real life applications.

In [18]:
!pip install -q shap
import shap
explainer = shap.Explainer(sentiment_pipeline)
shap_values = explainer([review1, review2])
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 547.9/547.9 kB 8.7 MB/s eta 0:00:00
  0%|          | 0/498 [00:00<?, ?it/s]
Partition explainer:  50%|█████     | 1/2 [00:00<?, ?it/s]
  0%|          | 0/498 [00:00<?, ?it/s]
Partition explainer: 3it [05:11, 155.63s/it]

8. A nice thing about the shap module is that it comes with a built-in visualizer. Use shap.plots.text to visualize the shap values for the first and the second movie review.

In [19]:
shap.plots.text(shap_values[0]) # first review
outputs
NEGATIVE
POSITIVE


0.40.1-0.20.710.5828920.582892base value00fNEGATIVE(inputs)0.052 too good 0.045 t 0.041 pathetic 0.024 only 0.024 that 0.022 able 0.022 unlike 0.017 characterhilariously 0.016 saying anything 0.015 full of 0.015 awkwardunease 0.015 at making his 0.014 . 0.013 and isn't 0.012 The Menu 0.012 isn' 0.011 new 0.01 characters of 0.01 incompetence 0.009 Hoult whose almost 0.008 the first to 0.007 enhances 0.007 prevent it from being a great satire that 0.006 surrogate amongst a seaof deliberately 0.004 pretentiousness of its characters and 0.003 and their 0.0 Mark Mylod's -0.06 terrific -0.06 between beautiful and -0.058 is really good at -0.055 but -0.045 extremely -0.038 performance -0.037 than enough visual style -0.036 funny -0.034 fun at -0.033 Stetson is fantastic -0.029 excellent -0.027 yl -0.025 consistently -0.025 st -0.024 ish -0.024 unusual balance -0.024 striking a -0.023 . -0.023 playful -0.023 and -0.021 has more -0.018 unnerving -0.018 gives -0.018 a -0.017 is a -0.017 perfect audience -0.016 direction is -0.016 satiri -0.014 everything it -0.014 building tension. -0.013 pokes -0.013 , -0.012 Anya Taylor-Joy -0.011 rich -0.01 the -0.009 Fienne -0.009 best is Nicholas -0.009 Ralph -0.008 . -0.007 doesn't -0.007 , -0.006 can in -0.006 which the -0.006 se -0.005 The music by Colin -0.005 screen presence -0.004 , -0.004 his commanding -0.003 ways that are often -0.002 . -0.001 s -0.001 to match the -0.001 the film -0.0 that definitely
inputs
0.012 / 3
The Menu
0.012 / 2
isn'
0.045
t
0.008 / 3
the first to
-0.016 / 2
satiri
-0.006
se
-0.01
the
-0.011
rich
0.003 / 2
and their
0.01 / 4
incompetence
0.013 / 4
and isn't
0.016 / 2
saying anything
0.011
new
-0.055
but
-0.0 / 2
that definitely
-0.007 / 3
doesn't
0.007 / 8
prevent it from being a great satire that
-0.013 / 2
pokes
-0.034 / 2
fun at
-0.014 / 2
everything it
-0.006 / 2
can in
-0.003 / 4
ways that are often
-0.025
consistently
-0.036
funny
-0.004
,
-0.023
playful
-0.023
and
-0.045
extremely
-0.025
st
-0.027
yl
-0.024
ish
-0.023
.
-0.009
Ralph
-0.009 / 2
Fienne
-0.001
s
-0.018
gives
-0.018
a
-0.06
terrific
-0.038
performance
0.015 / 2
full of
0.015 / 3
awkwardunease
0.024
that
0.024
only
0.007 / 2
enhances
-0.004 / 2
his commanding
-0.005 / 2
screen presence
-0.008
.
-0.012 / 4
Anya Taylor-Joy
-0.017 / 2
is a
-0.017 / 2
perfect audience
0.006 / 7
surrogate amongst a seaof deliberately
0.022
unlike
0.022
able
0.01 / 2
characters of
-0.006 / 2
which the
-0.009 / 3
best is Nicholas
0.009 / 4
Hoult whose almost
0.052 / 2
too good
0.015 / 3
at making his
0.017 / 4
characterhilariously
0.041
pathetic
0.014
.
0.0 / 6
Mark Mylod's
-0.016 / 2
direction is
-0.029
excellent
-0.007
,
-0.001 / 2
the film
-0.021 / 2
has more
-0.037 / 4
than enough visual style
-0.001 / 3
to match the
0.004 / 8
pretentiousness of its characters and
-0.058 / 4
is really good at
-0.014 / 3
building tension.
-0.005 / 4
The music by Colin
-0.033 / 4
Stetson is fantastic
-0.013
,
-0.024 / 2
striking a
-0.024 / 2
unusual balance
-0.06 / 3
between beautiful and
-0.018 / 3
unnerving
-0.002
.
0.0
0.30.1-0.1-0.30.50.70.90.5828920.582892base value00fNEGATIVE(inputs)0.052 too good 0.045 t 0.041 pathetic 0.024 only 0.024 that 0.022 able 0.022 unlike 0.017 characterhilariously 0.016 saying anything 0.015 full of 0.015 awkwardunease 0.015 at making his 0.014 . 0.013 and isn't 0.012 The Menu 0.012 isn' 0.011 new 0.01 characters of 0.01 incompetence 0.009 Hoult whose almost 0.008 the first to 0.007 enhances 0.007 prevent it from being a great satire that 0.006 surrogate amongst a seaof deliberately 0.004 pretentiousness of its characters and 0.003 and their 0.0 Mark Mylod's -0.06 terrific -0.06 between beautiful and -0.058 is really good at -0.055 but -0.045 extremely -0.038 performance -0.037 than enough visual style -0.036 funny -0.034 fun at -0.033 Stetson is fantastic -0.029 excellent -0.027 yl -0.025 consistently -0.025 st -0.024 ish -0.024 unusual balance -0.024 striking a -0.023 . -0.023 playful -0.023 and -0.021 has more -0.018 unnerving -0.018 gives -0.018 a -0.017 is a -0.017 perfect audience -0.016 direction is -0.016 satiri -0.014 everything it -0.014 building tension. -0.013 pokes -0.013 , -0.012 Anya Taylor-Joy -0.011 rich -0.01 the -0.009 Fienne -0.009 best is Nicholas -0.009 Ralph -0.008 . -0.007 doesn't -0.007 , -0.006 can in -0.006 which the -0.006 se -0.005 The music by Colin -0.005 screen presence -0.004 , -0.004 his commanding -0.003 ways that are often -0.002 . -0.001 s -0.001 to match the -0.001 the film -0.0 that definitely
inputs
0.012 / 3
The Menu
0.012 / 2
isn'
0.045
t
0.008 / 3
the first to
-0.016 / 2
satiri
-0.006
se
-0.01
the
-0.011
rich
0.003 / 2
and their
0.01 / 4
incompetence
0.013 / 4
and isn't
0.016 / 2
saying anything
0.011
new
-0.055
but
-0.0 / 2
that definitely
-0.007 / 3
doesn't
0.007 / 8
prevent it from being a great satire that
-0.013 / 2
pokes
-0.034 / 2
fun at
-0.014 / 2
everything it
-0.006 / 2
can in
-0.003 / 4
ways that are often
-0.025
consistently
-0.036
funny
-0.004
,
-0.023
playful
-0.023
and
-0.045
extremely
-0.025
st
-0.027
yl
-0.024
ish
-0.023
.
-0.009
Ralph
-0.009 / 2
Fienne
-0.001
s
-0.018
gives
-0.018
a
-0.06
terrific
-0.038
performance
0.015 / 2
full of
0.015 / 3
awkwardunease
0.024
that
0.024
only
0.007 / 2
enhances
-0.004 / 2
his commanding
-0.005 / 2
screen presence
-0.008
.
-0.012 / 4
Anya Taylor-Joy
-0.017 / 2
is a
-0.017 / 2
perfect audience
0.006 / 7
surrogate amongst a seaof deliberately
0.022
unlike
0.022
able
0.01 / 2
characters of
-0.006 / 2
which the
-0.009 / 3
best is Nicholas
0.009 / 4
Hoult whose almost
0.052 / 2
too good
0.015 / 3
at making his
0.017 / 4
characterhilariously
0.041
pathetic
0.014
.
0.0 / 6
Mark Mylod's
-0.016 / 2
direction is
-0.029
excellent
-0.007
,
-0.001 / 2
the film
-0.021 / 2
has more
-0.037 / 4
than enough visual style
-0.001 / 3
to match the
0.004 / 8
pretentiousness of its characters and
-0.058 / 4
is really good at
-0.014 / 3
building tension.
-0.005 / 4
The music by Colin
-0.033 / 4
Stetson is fantastic
-0.013
,
-0.024 / 2
striking a
-0.024 / 2
unusual balance
-0.06 / 3
between beautiful and
-0.018 / 3
unnerving
-0.002
.
0.0
0.40.1-0.20.7100base value0.9983010.998301fPOSITIVE(inputs)0.084 between beautiful and 0.061 but 0.058 is really good at 0.054 extremely 0.05 terrific 0.049 than enough visual style 0.047 Stetson is fantastic 0.044 funny 0.043 and 0.037 performance 0.037 . 0.035 consistently 0.035 fun at 0.034 striking a 0.031 playful 0.031 unusual balance 0.029 , 0.029 ish 0.028 excellent 0.026 has more 0.025 yl 0.025 st 0.023 unnerving 0.022 gives 0.021 a 0.02 everything it 0.02 direction is 0.02 perfect audience 0.019 is a 0.018 building tension. 0.017 , 0.017 ways that are often 0.014 Anya Taylor-Joy 0.013 , 0.013 . 0.013 satiri 0.01 pokes 0.01 Ralph 0.009 Fienne 0.009 pretentiousness of its characters and 0.009 can in 0.009 rich 0.008 the 0.008 which the 0.008 best is Nicholas 0.008 The music by Colin 0.007 . 0.006 the film 0.006 to match the 0.004 se 0.004 prevent it from being a great satire that 0.003 the first to 0.003 Mark Mylod's 0.002 doesn't 0.002 that definitely 0.001 s 0.001 Hoult whose almost 0.001 0.0 at making his 0.0 screen presence 0.0 isn' 0.0 their 0.0 and 0.0 new 0.0 and 0.0 anything 0.0 saying 0.0 t 0.0 nce 0.0 ete 0.0 omp 0.0 inc 0.0 ' 0.0 isn -0.039 too good -0.037 t -0.032 pathetic -0.031 only -0.027 that -0.024 unlike -0.023 able -0.02 full of -0.016 awkwardunease -0.013 characters of -0.004 . -0.002 surrogate amongst a seaof deliberately -0.001 characterhilariously -0.001 enhances -0.0 The Menu -0.0 his commanding
inputs
-0.0 / 3
The Menu
0.0 / 2
isn'
-0.037
t
0.003 / 3
the first to
0.013 / 2
satiri
0.004
se
0.008
the
0.009
rich
0.0
and
0.0
their
0.0
inc
0.0
omp
0.0
ete
0.0
nce
0.0
and
0.0
isn
0.0
'
0.0
t
0.0
saying
0.0
anything
0.0
new
0.061
but
0.002 / 2
that definitely
0.002 / 3
doesn't
0.004 / 8
prevent it from being a great satire that
0.01 / 2
pokes
0.035 / 2
fun at
0.02 / 2
everything it
0.009 / 2
can in
0.017 / 4
ways that are often
0.035
consistently
0.044
funny
0.017
,
0.031
playful
0.043
and
0.054
extremely
0.025
st
0.025
yl
0.029
ish
0.037
.
0.01
Ralph
0.009 / 2
Fienne
0.001
s
0.022
gives
0.021
a
0.05
terrific
0.037
performance
-0.02 / 2
full of
-0.016 / 3
awkwardunease
-0.027
that
-0.031
only
-0.001 / 2
enhances
-0.0 / 2
his commanding
0.0 / 2
screen presence
0.013
.
0.014 / 4
Anya Taylor-Joy
0.019 / 2
is a
0.02 / 2
perfect audience
-0.002 / 7
surrogate amongst a seaof deliberately
-0.024
unlike
-0.023
able
-0.013 / 2
characters of
0.008 / 2
which the
0.008 / 3
best is Nicholas
0.001 / 4
Hoult whose almost
-0.039 / 2
too good
0.0 / 3
at making his
-0.001 / 4
characterhilariously
-0.032
pathetic
-0.004
.
0.003 / 6
Mark Mylod's
0.02 / 2
direction is
0.028
excellent
0.013
,
0.006 / 2
the film
0.026 / 2
has more
0.049 / 4
than enough visual style
0.006 / 3
to match the
0.009 / 8
pretentiousness of its characters and
0.058 / 4
is really good at
0.018 / 3
building tension.
0.008 / 4
The music by Colin
0.047 / 4
Stetson is fantastic
0.029
,
0.034 / 2
striking a
0.031 / 2
unusual balance
0.084 / 3
between beautiful and
0.023 / 3
unnerving
0.007
.
0.001
0.50.2-0.10.81.100base value0.9983010.998301fPOSITIVE(inputs)0.084 between beautiful and 0.061 but 0.058 is really good at 0.054 extremely 0.05 terrific 0.049 than enough visual style 0.047 Stetson is fantastic 0.044 funny 0.043 and 0.037 performance 0.037 . 0.035 consistently 0.035 fun at 0.034 striking a 0.031 playful 0.031 unusual balance 0.029 , 0.029 ish 0.028 excellent 0.026 has more 0.025 yl 0.025 st 0.023 unnerving 0.022 gives 0.021 a 0.02 everything it 0.02 direction is 0.02 perfect audience 0.019 is a 0.018 building tension. 0.017 , 0.017 ways that are often 0.014 Anya Taylor-Joy 0.013 , 0.013 . 0.013 satiri 0.01 pokes 0.01 Ralph 0.009 Fienne 0.009 pretentiousness of its characters and 0.009 can in 0.009 rich 0.008 the 0.008 which the 0.008 best is Nicholas 0.008 The music by Colin 0.007 . 0.006 the film 0.006 to match the 0.004 se 0.004 prevent it from being a great satire that 0.003 the first to 0.003 Mark Mylod's 0.002 doesn't 0.002 that definitely 0.001 s 0.001 Hoult whose almost 0.001 0.0 at making his 0.0 screen presence 0.0 isn' 0.0 their 0.0 and 0.0 new 0.0 and 0.0 anything 0.0 saying 0.0 t 0.0 nce 0.0 ete 0.0 omp 0.0 inc 0.0 ' 0.0 isn -0.039 too good -0.037 t -0.032 pathetic -0.031 only -0.027 that -0.024 unlike -0.023 able -0.02 full of -0.016 awkwardunease -0.013 characters of -0.004 . -0.002 surrogate amongst a seaof deliberately -0.001 characterhilariously -0.001 enhances -0.0 The Menu -0.0 his commanding
inputs
-0.0 / 3
The Menu
0.0 / 2
isn'
-0.037
t
0.003 / 3
the first to
0.013 / 2
satiri
0.004
se
0.008
the
0.009
rich
0.0
and
0.0
their
0.0
inc
0.0
omp
0.0
ete
0.0
nce
0.0
and
0.0
isn
0.0
'
0.0
t
0.0
saying
0.0
anything
0.0
new
0.061
but
0.002 / 2
that definitely
0.002 / 3
doesn't
0.004 / 8
prevent it from being a great satire that
0.01 / 2
pokes
0.035 / 2
fun at
0.02 / 2
everything it
0.009 / 2
can in
0.017 / 4
ways that are often
0.035
consistently
0.044
funny
0.017
,
0.031
playful
0.043
and
0.054
extremely
0.025
st
0.025
yl
0.029
ish
0.037
.
0.01
Ralph
0.009 / 2
Fienne
0.001
s
0.022
gives
0.021
a
0.05
terrific
0.037
performance
-0.02 / 2
full of
-0.016 / 3
awkwardunease
-0.027
that
-0.031
only
-0.001 / 2
enhances
-0.0 / 2
his commanding
0.0 / 2
screen presence
0.013
.
0.014 / 4
Anya Taylor-Joy
0.019 / 2
is a
0.02 / 2
perfect audience
-0.002 / 7
surrogate amongst a seaof deliberately
-0.024
unlike
-0.023
able
-0.013 / 2
characters of
0.008 / 2
which the
0.008 / 3
best is Nicholas
0.001 / 4
Hoult whose almost
-0.039 / 2
too good
0.0 / 3
at making his
-0.001 / 4
characterhilariously
-0.032
pathetic
-0.004
.
0.003 / 6
Mark Mylod's
0.02 / 2
direction is
0.028
excellent
0.013
,
0.006 / 2
the film
0.026 / 2
has more
0.049 / 4
than enough visual style
0.006 / 3
to match the
0.009 / 8
pretentiousness of its characters and
0.058 / 4
is really good at
0.018 / 3
building tension.
0.008 / 4
The music by Colin
0.047 / 4
Stetson is fantastic
0.029
,
0.034 / 2
striking a
0.031 / 2
unusual balance
0.084 / 3
between beautiful and
0.023 / 3
unnerving
0.007
.
0.001
In [20]:
shap.plots.text(shap_values[1]) # second review
outputs
NEGATIVE
POSITIVE


0.40-0.40.81.20.5861360.586136base value0.9622440.962244fNEGATIVE(inputs)0.128 not 0.101 disappointing 0.074 failed 0.064 This 0.06 did 0.056 the writing 0.05 because 0.037 deliver the 0.036 didn't 0.036 . 0.033 Sadly it 0.032 Menu 0.03 It all unwound 0.022 and drabs and hints of what might 0.022 really go anywhere 0.021 . 0.018 There were little 0.014 that 0.014 . 0.013 meal as advertised 0.011 only a few lines. 0.01 . 0.007 in the second half. 0.007 dribs 0.004 was 0.004 extra 0.003 Even supporting characters with 0.002 And 0.001 spoilers 0.001 That' 0.001 The acting was still on 0.001 but 0.0 -0.099 nicely -0.073 good -0.066 building -0.065 good -0.052 film was -0.023 say without -0.022 giving up any -0.018 so -0.018 suspense was -0.01 s -0.009 . -0.007 first -0.007 in the -0.007 be coming without being too obvious. -0.007 the -0.007 from -0.007 . -0.006 everyone -0.006 The -0.006 acting -0.006 The tension and -0.005 i can -0.004 This looked like an interesting film based on the trailer and the first half of it was just that. -0.004 the most -0.001 Were well realized I remember thinking that I couldn't wait to see where it was all going. -0.001 . -0.001 half was
inputs
-0.004 / 21
This looked like an interesting film based on the trailer and the first half of it was just that.
-0.006 / 3
The tension and
-0.018 / 2
suspense was
-0.066
building
-0.099
nicely
-0.001
.
0.018 / 3
There were little
0.007 / 3
dribs
0.022 / 9
and drabs and hints of what might
-0.007 / 7
be coming without being too obvious.
-0.006
The
-0.006
acting
-0.007
from
-0.006
everyone
-0.007 / 2
in the
-0.052 / 2
film was
-0.073
good
-0.009
.
0.003 / 4
Even supporting characters with
0.011 / 5
only a few lines.
-0.001 / 20
Were well realized I remember thinking that I couldn't wait to see where it was all going.
0.033 / 2
Sadly it
0.036 / 3
didn't
0.022 / 3
really go anywhere
0.014
.
0.03 / 5
It all unwound
0.007 / 5
in the second half.
0.001 / 5
The acting was still on
0.001
but
0.056 / 2
the writing
0.074
failed
0.01
.
0.001 / 2
That'
-0.01
s
-0.004 / 2
the most
-0.005 / 2
i can
-0.023 / 2
say without
-0.022 / 3
giving up any
0.001 / 2
spoilers
-0.007
.
0.002
And
0.014
that
0.004
was
0.004
extra
0.101
disappointing
0.05
because
-0.007
the
-0.007
first
-0.001 / 2
half was
-0.018
so
-0.065
good
0.021
.
0.064
This
0.032
Menu
0.06
did
0.128
not
0.037 / 2
deliver the
0.013 / 3
meal as advertised
0.036
.
0.0
0.80.50.21.11.40.5861360.586136base value0.9622440.962244fNEGATIVE(inputs)0.128 not 0.101 disappointing 0.074 failed 0.064 This 0.06 did 0.056 the writing 0.05 because 0.037 deliver the 0.036 didn't 0.036 . 0.033 Sadly it 0.032 Menu 0.03 It all unwound 0.022 and drabs and hints of what might 0.022 really go anywhere 0.021 . 0.018 There were little 0.014 that 0.014 . 0.013 meal as advertised 0.011 only a few lines. 0.01 . 0.007 in the second half. 0.007 dribs 0.004 was 0.004 extra 0.003 Even supporting characters with 0.002 And 0.001 spoilers 0.001 That' 0.001 The acting was still on 0.001 but 0.0 -0.099 nicely -0.073 good -0.066 building -0.065 good -0.052 film was -0.023 say without -0.022 giving up any -0.018 so -0.018 suspense was -0.01 s -0.009 . -0.007 first -0.007 in the -0.007 be coming without being too obvious. -0.007 the -0.007 from -0.007 . -0.006 everyone -0.006 The -0.006 acting -0.006 The tension and -0.005 i can -0.004 This looked like an interesting film based on the trailer and the first half of it was just that. -0.004 the most -0.001 Were well realized I remember thinking that I couldn't wait to see where it was all going. -0.001 . -0.001 half was
inputs
-0.004 / 21
This looked like an interesting film based on the trailer and the first half of it was just that.
-0.006 / 3
The tension and
-0.018 / 2
suspense was
-0.066
building
-0.099
nicely
-0.001
.
0.018 / 3
There were little
0.007 / 3
dribs
0.022 / 9
and drabs and hints of what might
-0.007 / 7
be coming without being too obvious.
-0.006
The
-0.006
acting
-0.007
from
-0.006
everyone
-0.007 / 2
in the
-0.052 / 2
film was
-0.073
good
-0.009
.
0.003 / 4
Even supporting characters with
0.011 / 5
only a few lines.
-0.001 / 20
Were well realized I remember thinking that I couldn't wait to see where it was all going.
0.033 / 2
Sadly it
0.036 / 3
didn't
0.022 / 3
really go anywhere
0.014
.
0.03 / 5
It all unwound
0.007 / 5
in the second half.
0.001 / 5
The acting was still on
0.001
but
0.056 / 2
the writing
0.074
failed
0.01
.
0.001 / 2
That'
-0.01
s
-0.004 / 2
the most
-0.005 / 2
i can
-0.023 / 2
say without
-0.022 / 3
giving up any
0.001 / 2
spoilers
-0.007
.
0.002
And
0.014
that
0.004
was
0.004
extra
0.101
disappointing
0.05
because
-0.007
the
-0.007
first
-0.001 / 2
half was
-0.018
so
-0.065
good
0.021
.
0.064
This
0.032
Menu
0.06
did
0.128
not
0.037 / 2
deliver the
0.013 / 3
meal as advertised
0.036
.
0.0
0.40-0.40.81.200base value00fPOSITIVE(inputs)0.123 nicely 0.099 good 0.078 film was 0.077 building 0.05 good 0.033 suspense was 0.033 giving up any 0.033 say without 0.026 . 0.016 so 0.014 . 0.012 s 0.012 . 0.012 in the 0.008 i can 0.008 the most 0.008 The 0.008 acting 0.006 from 0.006 everyone 0.006 spoilers 0.004 That' 0.004 the 0.004 first 0.004 half was 0.003 Even supporting characters with only a few lines. 0.002 The acting was still on 0.002 Were well realized I remember thinking that I couldn't wait to see where it was all going. 0.002 The tension and 0.002 This looked like an interesting film based on the trailer and the first half of it was just that. 0.002 There were little dribs and drabs and hints of what might be coming without being too obvious. 0.001 but -0.124 not -0.11 disappointing -0.076 This -0.06 failed -0.054 did -0.049 because -0.042 the writing -0.033 deliver the -0.032 Menu -0.028 . -0.022 . -0.011 really go anywhere. -0.01 Sadly it didn't -0.01 It all unwound -0.009 that -0.007 was -0.007 in the second half. -0.006 meal as advertised -0.003 . -0.003 extra -0.001 And
inputs
0.002 / 21
This looked like an interesting film based on the trailer and the first half of it was just that.
0.002 / 3
The tension and
0.033 / 2
suspense was
0.077
building
0.123
nicely
0.012
.
0.002 / 22
There were little dribs and drabs and hints of what might be coming without being too obvious.
0.008
The
0.008
acting
0.006
from
0.006
everyone
0.012 / 2
in the
0.078 / 2
film was
0.099
good
0.026
.
0.003 / 9
Even supporting characters with only a few lines.
0.002 / 20
Were well realized I remember thinking that I couldn't wait to see where it was all going.
-0.01 / 5
Sadly it didn't
-0.011 / 4
really go anywhere.
-0.01 / 5
It all unwound
-0.007 / 5
in the second half.
0.002 / 5
The acting was still on
0.001
but
-0.042 / 2
the writing
-0.06
failed
-0.003
.
0.004 / 2
That'
0.012
s
0.008 / 2
the most
0.008 / 2
i can
0.033 / 2
say without
0.033 / 3
giving up any
0.006 / 2
spoilers
0.014
.
-0.001
And
-0.009
that
-0.007
was
-0.003
extra
-0.11
disappointing
-0.049
because
0.004
the
0.004
first
0.004 / 2
half was
0.016
so
0.05
good
-0.022
.
-0.076
This
-0.032
Menu
-0.054
did
-0.124
not
-0.033 / 2
deliver the
-0.006 / 3
meal as advertised
-0.028
.
0.0
0-0.2-0.4-0.60.20.40.600base value00fPOSITIVE(inputs)0.123 nicely 0.099 good 0.078 film was 0.077 building 0.05 good 0.033 suspense was 0.033 giving up any 0.033 say without 0.026 . 0.016 so 0.014 . 0.012 s 0.012 . 0.012 in the 0.008 i can 0.008 the most 0.008 The 0.008 acting 0.006 from 0.006 everyone 0.006 spoilers 0.004 That' 0.004 the 0.004 first 0.004 half was 0.003 Even supporting characters with only a few lines. 0.002 The acting was still on 0.002 Were well realized I remember thinking that I couldn't wait to see where it was all going. 0.002 The tension and 0.002 This looked like an interesting film based on the trailer and the first half of it was just that. 0.002 There were little dribs and drabs and hints of what might be coming without being too obvious. 0.001 but -0.124 not -0.11 disappointing -0.076 This -0.06 failed -0.054 did -0.049 because -0.042 the writing -0.033 deliver the -0.032 Menu -0.028 . -0.022 . -0.011 really go anywhere. -0.01 Sadly it didn't -0.01 It all unwound -0.009 that -0.007 was -0.007 in the second half. -0.006 meal as advertised -0.003 . -0.003 extra -0.001 And
inputs
0.002 / 21
This looked like an interesting film based on the trailer and the first half of it was just that.
0.002 / 3
The tension and
0.033 / 2
suspense was
0.077
building
0.123
nicely
0.012
.
0.002 / 22
There were little dribs and drabs and hints of what might be coming without being too obvious.
0.008
The
0.008
acting
0.006
from
0.006
everyone
0.012 / 2
in the
0.078 / 2
film was
0.099
good
0.026
.
0.003 / 9
Even supporting characters with only a few lines.
0.002 / 20
Were well realized I remember thinking that I couldn't wait to see where it was all going.
-0.01 / 5
Sadly it didn't
-0.011 / 4
really go anywhere.
-0.01 / 5
It all unwound
-0.007 / 5
in the second half.
0.002 / 5
The acting was still on
0.001
but
-0.042 / 2
the writing
-0.06
failed
-0.003
.
0.004 / 2
That'
0.012
s
0.008 / 2
the most
0.008 / 2
i can
0.033 / 2
say without
0.033 / 3
giving up any
0.006 / 2
spoilers
0.014
.
-0.001
And
-0.009
that
-0.007
was
-0.003
extra
-0.11
disappointing
-0.049
because
0.004
the
0.004
first
0.004 / 2
half was
0.016
so
0.05
good
-0.022
.
-0.076
This
-0.032
Menu
-0.054
did
-0.124
not
-0.033 / 2
deliver the
-0.006 / 3
meal as advertised
-0.028
.
0.0

Features highlighted in red are increasing the predicted probability, while features highlighted in blue are lowering the predicted probability.

Part 4: Fine-tuning BERT using the IMDb dataset¶

Now let's do a sentiment analysis of the IMDB dataset using the off-the-shelf sentiment analysis pipeline.

Since the DistilBERT we are using was trained on the Stanford Sentiment Treebank (SST) dataset, we also fine-tune our model for IMDb movie reviews. In practice, this might not be neccesary for this particular application, but might be good to see how it can be done.

In [21]:
!pip install -q datasets
!pip install -q transformers
!pip install -q evaluate
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 486.2/486.2 kB 5.7 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 110.5/110.5 kB 10.5 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 212.5/212.5 kB 9.1 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.3/134.3 kB 11.4 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 81.4/81.4 kB 2.1 MB/s eta 0:00:00
In [22]:
from datasets import load_dataset

from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

import evaluate
import numpy as np

9. Load the IMDb dataset and sample 10% of the train and test.

In [23]:
imdb = load_dataset("imdb")
del imdb['unsupervised']
Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]
Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]
Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]
Downloading and preparing dataset imdb/plain_text to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0...
Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]
Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]
Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]
Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]
Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0. Subsequent calls will reuse this data.
  0%|          | 0/3 [00:00<?, ?it/s]
In [24]:
imdb["test"][0] # examine the first instance in test
Out[24]:
{'text': 'I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn\'t match the background, and painfully one-dimensional characters cannot be overcome with a \'sci-fi\' setting. (I\'m sure there are those of you out there who think Babylon 5 is good sci-fi TV. It\'s not. It\'s clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It\'s really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it\'s rubbish as they have to always say "Gene Roddenberry\'s Earth..." otherwise people would not continue watching. Roddenberry\'s ashes must be turning in their orbit as this dull, cheap, poorly edited (watching it without advert breaks really brings this home) trudging Trabant of a show lumbers into space. Spoiler. So, kill off a main character. And then bring him back as another actor. Jeeez! Dallas all over again.',
 'label': 0}
In [25]:
imdb.shape # inspect dimensions full data
Out[25]:
{'train': (25000, 2), 'test': (25000, 2)}

Because fine-tuning on the entire IMDb dataset would be too resource-intensive to run in this practical, we will work with a randomly sampled 10% of the original train and test dataset size.

In [26]:
imdb_sample = imdb
imdb_sample['train'] = imdb['train'].shuffle(seed=42).select(range(int(0.1*len(imdb['train']))))
imdb_sample['test'] = imdb['test'].shuffle(seed=42).select(range(int(0.1*len(imdb['test']))))
In [27]:
imdb_sample.shape
Out[27]:
{'train': (2500, 2), 'test': (2500, 2)}

10. Create a preprocessing function to tokenize text and truncate sequences to be no longer than DistilBERT’s maximum input length. To apply the preprocessing function over the entire dataset, use Datasets map function. You can speed up map by setting batched=True to process multiple elements of the dataset at once.

In [28]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
In [29]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)
In [30]:
tokenized_imdb = imdb_sample.map(preprocess_function, batched=True)
Map:   0%|          | 0/2500 [00:00<?, ? examples/s]
Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

11. Load the accuracy metric from the evaluate library to evaluate the model performance. Define a function takes the output predictions and true labels from a machine learning model. Processes the predictions to convert them into class indices, and then calculates and return the accuracy score.

In [31]:
accuracy = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)
Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

12. Load the DistilBERT model we had used earlier using AutoModelForSequenceClassification and fine-tune it on the IMDb dataset using the Trainer function. You can use the following training arguments:

In [32]:
training_args = TrainingArguments(
    output_dir="tuned_model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    logging_steps = 100,
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False)

We create two dictionaries, id2label and label2id.

In [33]:
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

Fine-tuning the model should take around 5 minutes.

In [45]:
from transformers import set_seed
set_seed(137)

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english", num_labels=2, id2label=id2label, label2id=label2id)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics)

trainer.train()
trainer.save_model()
This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
[314/314 05:44, Epoch 2/2]
Epoch Training Loss Validation Loss Accuracy
1 0.323600 0.233133 0.912000
2 0.145300 0.294243 0.908400

How do the results compare to the IMDb sentiment classification we performed using different neural network architectures?

13. Load the model you just fine-tuned into the pipeline and classify a sentence of choice.

In [48]:
classifier = pipeline("sentiment-analysis", model="tuned_model")
classifier("The movie was an experience.")
Out[48]:
[{'label': 'POSITIVE', 'score': 0.9790797233581543}]

Let's compare the output to our initial model.

In [49]:
sentiment_pipeline = pipeline("sentiment-analysis", model = 'distilbert-base-uncased-finetuned-sst-2-english')
sentiment_pipeline("The movie was an experience.")
Out[49]:
[{'label': 'POSITIVE', 'score': 0.9962561130523682}]

We see that our model is slightly different than the original SST-2 trained model. Which one do you agree with?

Remember: Be on the lookout for bias and other limitations!¶

Pre-trained transformer models have been made available for many different tasks and by many different people. It is important to be aware that there may be bias and other limitations in the models that could affect your results.

DistilBERT is known to produce biased predictions that target underrepresented populations. For instance, for sentences like This film was filmed in COUNTRY, DistilBERT for binary classification will give radically different probabilities for the positive label depending on the country (0.89 if the country is France, but 0.08 if the country is Afghanistan) when nothing in the input indicates such a strong semantic shift.

See:

Risks, Limitations and Biases

Aurélien Géron's Sentiment Bias Map

In [50]:
sentiment_pipeline("French movie")
Out[50]:
[{'label': 'POSITIVE', 'score': 0.9987333416938782}]
In [51]:
sentiment_pipeline("Iraqi movie")
Out[51]:
[{'label': 'NEGATIVE', 'score': 0.6413735747337341}]
In [46]:
classifier("French movie")
Out[46]:
[{'label': 'POSITIVE', 'score': 0.9880437254905701}]
In [47]:
classifier("Iraqi movie")
Out[47]:
[{'label': 'POSITIVE', 'score': 0.6644718050956726}]

When in doubt fine-tune and use feature importance measures!

Further reading / materials¶

  • How to fine-tune a model: https://huggingface.co/docs/transformers/training

Credits¶

Many code and quote blocks are adapted from the HuggingFace Documentation website. The website contains a lot of additional information and is a great resource for learners.