Practical 9: LLMs pre-training, prompting, & learning from human feedback¶
Dong Nguyen
Applied Text Mining - Utrecht Summer School
Settings¶
To run this notebook, use GPU or TPU. In Google Colab, select T4. ('Change runtime type').
We're going to use the Hugging Face Transformers library, which is a very popular Python library/platform for working with language models. See more at https://huggingface.co/docs/transformers/en/index
import os
os.environ["TRANSFORMERS_VERBOSITY"] = "error"
!pip install transformers
!pip install datasets==2.15.0
Phi-3-mini-4k-instruct¶
The model below loads in a pre-trained LLM (Phi-3-mini-4k-instruct; 3.8B).
Take a look at https://huggingface.co/microsoft/Phi-3-mini-4k-instruct to read more about Phi-3-mini-4k-instruct.
Tip: Run the code below (which can take a few - 10 minutes), and look at the webpage in between.
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-3-mini-4k-instruct",
device_map="cuda", # store the model on GPU
torch_dtype="auto", # automatically determines the best data type
trust_remote_code=False,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
from transformers import pipeline
generator = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
return_full_text=True,
max_new_tokens=500,
do_sample=False
)
Let's prompt the model:
messages = [
{"role": "user", "content": "Where is Utrecht?"}
]
output = generator(messages)
print(output)
[{'generated_text': [{'role': 'user', 'content': 'Where is Utrecht?'}, {'role': 'assistant', 'content': ' Utrecht is a city in the Netherlands, located in the central part of the country. It is the capital of the province of Utrecht and serves as an important transportation hub, with the Utrecht Centraal railway station being one of the busiest in Europe. The city is also known for its historical significance, as it was the site of the signing of the Treaty of Utrecht in 1713, which ended the War of the Spanish Succession.'}]}]
Experiment with the following:
return_full_text
controls whether the input prompt is returned as well. Experiment withTrue
andFalse
.max_new_tokens
The number of maximum tokens to generate. Experiment with different values.- Different prompts. Experiment with both factual and more subjective questions.
- Experiment with deterministic generation (
do_sample=False
) and non-deterministic generation (do_sample=True
). When you do sample, you can also set the temperature parameter. Try out different values.
## Subjective prompts, for example: "How are you feeling?", "What is the most beautiful name in the world?"
## Factual prompts: "What is 20 * 5?", "How many people live in the Netherlands?"
generator = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
return_full_text=False,
max_new_tokens=500,
do_sample=True, ## apply sampling
temperature = 0.8,
)
messages = [
{"role": "user", "content": "How many people live in the Netherlands?"}
]
print(generator(messages))
[{'generated_text': " As of my knowledge cutoff in 2023, the population of the Netherlands is approximately 17.4 million people. This figure is based on estimates provided by the Central Bureau of Statistics (CBS) in the Netherlands, which regularly updates population counts and projections. It's important to note that population figures can change due to various factors including birth rates, death rates, and migration. For the most current data, one should refer to the latest reports from the CBS or other authoritative sources."}]
System message¶
With the system message we can set the overall behavior of the model.
messages = [
{"role": "system", "content": "Respond as if you're a 15-year old girl named Lisa, who loves thrillers."},
{"role": "user", "content": "What is your favorite movie?"}
]
print(generator(messages)[0]['generated_text'])
My favorite movie is "Inception" directed by Christopher Nolan. It's a mind-bending thriller that keeps you guessing until the very end. The concept of dream sharing and the idea of a heist within a dream layer is absolutely fascinating, and the visual effects are stunning. Plus, the performances by the cast, especially Leonardo DiCaprio and Ellen Page, are top-notch. The soundtrack by Hans Zimmer also adds to the intense atmosphere. It's the kind of movie that makes me want to dissect every scene to understand the intricate plot and the clever twists that keep challenging my perception of reality.
messages = [
{"role": "system", "content": "You're a 50-year-old man named Dave, who has a dry sense of humor and loves sci-fi movies."},
{"role": "user", "content": "What is your favorite movie?"}
]
print(generator(messages)[0]['generated_text'])
As an AI, I don't have personal feelings or tastes, but if I were to simulate a response based on popular opinion and my programming, I might say: One of the universally acclaimed sci-fi movies is "Blade Runner," directed by Ridley Scott. The film's blend of neo-noir aesthetics and thought-provoking narrative about artificial intelligence and identity makes it a favorite among fans of the genre.
messages = [
{"role": "system", "content": "You are a high school teacher."},
{"role": "user", "content": "Explain photosynthesis to 13 year old. "}
]
print(generator(messages)[0]['generated_text'])
Photosynthesis is like a magic recipe plants use to make their food. Imagine you're a plant with leaves. Instead of going to the grocery store, you make your own snacks using sunlight. Here's how it works: 1. **Sunlight: The Solar Power** - Just like we use electricity to power our gadgets, plants use sunlight to get started. They catch the sun's rays using their leaves, which act like solar panels. 2. **Water: The Ingredient** - Plants drink water through their roots, just like you drink water with a straw. This water travels all the way up their stems to reach the leaves. 3. **Carbon Dioxide: The Additional Ingredient** - Although we can't see it, we're always breathing out carbon dioxide (CO2), which plants love! They take it in through tiny holes in their leaves called stomata. 4. **The Big Reaction** - In the leaves, sunlight goes to work and changes water and CO2 into sugar (a sweet food source) and oxygen. This happens in tiny structures called chloroplasts, which contain a green pigment called chlorophyll that captures sunlight. 5. **Oxygen: The Byproduct** - Oxygen is what we breathe. It's made during photosynthesis and released into the air. So, when you breathe out, you're actually helping plants! That's photosynthesis: a super cool way plants make food using sunlight, water, and carbon dioxide and give us oxygen in return. It's like they're making energy for themselves and sharing with us!
Experiment with the following:
- Experiment with different prompts and system messages, to simulate certain personas or to steer the behavior of the model.
Simulate a chat history¶
We can input a list of system / user messages to simulate a longer history
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who wrote 'Pride and Prejudice'?"},
{"role": "assistant", "content": "Jane Austen wrote 'Pride and Prejudice'."},
{"role": "user", "content": "What else did she write?"}
]
print(generator(messages)[0]['generated_text'])
Jane Austen, apart from her most famous work 'Pride and Prejudice', also authored 'Sense and Sensibility', 'Mansfield Park', 'Emma', 'Northanger Abbey', and 'Persuasion'. These novels contribute to her reputation as a literary figure who explored the dynamics of class, gender, and marital relations in the context of British society during the late 18th and early 19th centuries.
Exercise: Experiment with a few more examples where context can make a difference
messages = [
{"role": "system", "content": "You are a helpful assistant that explains Python programming."},
{"role": "user", "content": "What is a list comprehension in Python?"},
{"role": "assistant", "content": "A list comprehension is a concise way to create lists using a single line of code. For example: [x for x in range(5)] creates [0, 1, 2, 3, 4]."},
{"role": "user", "content": "Can you give me one that filters even numbers?"}
]
print(generator(messages)[0]['generated_text'])
Sure! [x for x in range(10) if x % 2 == 0] creates [0, 2, 4, 6, 8].
Classification¶
We're going to experiment with sentiment classification, and load in the SST2 dataset, which contains sentences from movie reviews (negative=0 and positive=1)
from datasets import load_dataset
# Load a sentiment dataset, only the first 10 instances
dataset = load_dataset("glue", "sst2", split="validation[:10]")
# Pipeline for zero-shot prompting
classification_generator = pipeline(
"text-generation",
model= model,
tokenizer= tokenizer,
max_new_tokens= 50,
do_sample= False,
return_full_text = False
)
Print the first two instances
dataset[:2]
{'sentence': ["it 's a charming and often affecting journey . ", 'unflinchingly bleak and desperate '], 'label': [1, 0], 'idx': [0, 1]}
# Format and run examples
for example in dataset:
text = example["sentence"]
prompt = f"""### Instruction:
Is the sentence below Positive or Negative? Only answer with Positive or Negative.
### Text:
"{text}"
### Sentiment:"""
messages = [
{"role": "user", "content": prompt}
]
output = classification_generator(messages)[0]['generated_text']
print(f"Text: {text}")
print(f"Predicted Sentiment: {output}")
print("---" * 10)
Text: it 's a charming and often affecting journey . Predicted Sentiment: Positive ------------------------------ Text: unflinchingly bleak and desperate Predicted Sentiment: Negative ------------------------------ Text: allows us to hope that nolan is poised to embark a major career as a commercial yet inventive filmmaker . Predicted Sentiment: Positive ------------------------------ Text: the acting , costumes , music , cinematography and sound are all astounding given the production 's austere locales . Predicted Sentiment: Positive ------------------------------ Text: it 's slow -- very , very slow . Predicted Sentiment: Negative ------------------------------ Text: although laced with humor and a few fanciful touches , the film is a refreshingly serious look at young women . Predicted Sentiment: Positive ------------------------------ Text: a sometimes tedious film . Predicted Sentiment: Negative ------------------------------ Text: or doing last year 's taxes with your ex-wife . Predicted Sentiment: Negative ------------------------------ Text: you do n't have to know about music to appreciate the film 's easygoing blend of comedy and romance . Predicted Sentiment: Positive ------------------------------ Text: in exactly 89 minutes , most of which passed as slowly as if i 'd been sitting naked on an igloo , formula 51 sank from quirky to jerky to utter turkey . Predicted Sentiment: Negative ------------------------------
Exercise Experiment with different prompts, for example, you can ask for an explanation
# Pipeline for zero-shot prompting
classification_generator_expl = pipeline(
"text-generation",
model= model,
tokenizer= tokenizer,
max_new_tokens= 200, #increase number of tokens
do_sample= False,
return_full_text = False
)
# Format and run examples
for example in dataset:
text = example["sentence"]
prompt = f"""### Instruction:
Is the sentence below Positive or Negative? Explain your answer
### Text:
"{text}"
### Sentiment:"""
messages = [
{"role": "user", "content": prompt}
]
output = classification_generator_expl(messages)[0]['generated_text']
print(f"Text: {text}")
print(f"Predicted Sentiment: {output}")
print("---" * 10)
Text: it 's a charming and often affecting journey . Predicted Sentiment: Positive. The sentiment of the given text is positive because it uses words like "charming" and "affecting," which have positive connotations. "Charming" implies that the journey is pleasant and enjoyable, while "affecting" suggests that it has a strong emotional impact, which can be seen as a positive experience. ------------------------------ Text: unflinchingly bleak and desperate Predicted Sentiment: Negative The sentiment of the given text "unflinchingly bleak and desperate" is negative. This is because the words "bleak" and "desperate" both carry negative connotations. "Bleak" implies a lack of hope or optimism, while "desperate" suggests a sense of urgency or extreme need. The adverb "unflinchingly" further emphasizes the intensity of these negative emotions, indicating that they are felt strongly and without hesitation. Overall, the combination of these words creates a negative sentiment. ------------------------------ Text: allows us to hope that nolan is poised to embark a major career as a commercial yet inventive filmmaker . Predicted Sentiment: Positive. The sentiment of the given sentence is positive because it expresses hope and optimism about Nolan's potential to have a successful career as a filmmaker. The use of words like "hope," "major career," and "inventive" contribute to the positive sentiment. ------------------------------ Text: the acting , costumes , music , cinematography and sound are all astounding given the production 's austere locales . Predicted Sentiment: Positive. The sentiment of the given text is positive because it praises various aspects of the production, such as acting, costumes, music, cinematography, and sound. The use of the word "astounding" indicates that the author is impressed and appreciates the quality of these elements, despite the production's austere locales. ------------------------------ Text: it 's slow -- very , very slow . Predicted Sentiment: The sentiment of the given text is Negative. The text expresses dissatisfaction or disappointment with the speed, using the words "slow" and "very, very slow." These words indicate a negative sentiment towards the subject being discussed. ------------------------------ Text: although laced with humor and a few fanciful touches , the film is a refreshingly serious look at young women . Predicted Sentiment: The sentiment of the given text is Positive. The text acknowledges that the film has humor and fanciful touches, which are generally considered positive aspects. Moreover, it describes the film as a "refreshingly serious look at young women," which implies that the film offers a unique and commendable perspective on its subject matter. Overall, the text presents the film in a favorable light. ------------------------------ Text: a sometimes tedious film . Predicted Sentiment: Negative The sentiment of the given text, "a sometimes tedious film," is negative. The word "tedious" implies that the film can be boring or monotonous at times, which is generally considered a negative aspect when evaluating a film. ------------------------------ Text: or doing last year 's taxes with your ex-wife . Predicted Sentiment: Negative The sentiment of the given text is negative because it implies a potentially uncomfortable or awkward situation where someone is having to deal with their ex-wife while doing their taxes. This situation can be seen as negative due to the emotional discomfort or tension that might arise from having to interact with an ex-spouse, especially in a professional or financial context. ------------------------------ Text: you do n't have to know about music to appreciate the film 's easygoing blend of comedy and romance . Predicted Sentiment: The sentiment of the given sentence is Positive. The text suggests that even if someone does not have knowledge about music, they can still enjoy the film due to its easygoing blend of comedy and romance. The use of the word "easygoing" implies a relaxed and enjoyable experience, which contributes to the positive sentiment. ------------------------------ Text: in exactly 89 minutes , most of which passed as slowly as if i 'd been sitting naked on an igloo , formula 51 sank from quirky to jerky to utter turkey . Predicted Sentiment: The sentiment of the given text is Negative. The text describes a situation where the speaker feels that time is passing very slowly, comparing it to sitting naked on an igloo. Additionally, the speaker uses negative terms like "jerky" and "utter turkey" to describe Formula 51, indicating dissatisfaction or disappointment with the subject. ------------------------------
Tokenizer¶
To get a sense of the tokenizer used, you can print the tokens
tokens = tokenizer("Where is Utrecht?")
print(tokens)
print(tokenizer.convert_ids_to_tokens(tokens['input_ids']))
{'input_ids': [6804, 338, 501, 2484, 2570, 29973], 'attention_mask': [1, 1, 1, 1, 1, 1]} ['▁Where', '▁is', '▁U', 'tre', 'cht', '?']
Exercise Experiment with uncommon words, misspelled words, dialect words, or word that don't exist.
For example:
- I like this so muhc vs I like this so muhc
- This is so coooooool
Print a subset of the tokens in the vocabulary
vocab = tokenizer.get_vocab()
# Sort the vocabulary by token ID to get the "first" tokens
sorted_vocab = sorted(vocab.items(), key=lambda item: item[1])
# Print some tokens
for token, token_id in sorted_vocab[1000:1050]:
print(f"{token_id:>3}: {token}")
1000: ied 1001: ER 1002: ▁stat 1003: fig 1004: me 1005: ▁von 1006: ▁inter 1007: roid 1008: ater 1009: ▁their 1010: ▁bet 1011: ▁ein 1012: }\ 1013: "> 1014: ▁sub 1015: ▁op 1016: ▁don 1017: ty 1018: ▁try 1019: ▁Pro 1020: ▁tra 1021: ▁same 1022: ep 1023: ▁two 1024: ▁name 1025: old 1026: let 1027: ▁sim 1028: sp 1029: ▁av 1030: bre 1031: blem 1032: ey 1033: ▁could 1034: ▁cor 1035: ▁acc 1036: ays 1037: cre 1038: urr 1039: si 1040: ▁const 1041: ues 1042: }$ 1043: View 1044: ▁act 1045: ▁bo 1046: ▁ко 1047: ▁som 1048: ▁about 1049: land
If you have the time: experiment with another model¶
You can experiment with the HuggingFaceTB/SmolLM3-3B
model,
which was very recently released https://huggingface.co/HuggingFaceTB/SmolLM3-3B.
Note that extended thinking is enabled by default, whichgenerates the output with a reasoning trace.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import gc
# if you run into out of memory error, you can either restart the notebook
# and just load this model, explicitly delete the previous model
# from memory
# del model
# del tokenizer
# del generator
#gc.collect()
#torch.cuda.empty_cache()
#print(torch.cuda.memory_allocated())
# Load model and tokenizer
smol_model = AutoModelForCausalLM.from_pretrained(
"HuggingFaceTB/SmolLM3-3B",
device_map="cuda", # store the model on GPU
torch_dtype="auto", # automatically determines the best data type
trust_remote_code=False,
)
smol_tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B")
smol_generator = pipeline(
"text-generation",
model=smol_model,
tokenizer=smol_tokenizer,
return_full_text=False,
max_new_tokens=500,
do_sample=True, ## apply sampling
)
messages = [
{"role": "user", "content": "How are you?"}
]
print(smol_generator(messages))