In this practical, we are first going to get acquainted with Python in Google Colab, then we will do some text preprocessing! Are you looking for Python documentation to refresh you knowledge of programming? If so, you can check https://docs.python.org/3/reference/
Google Colaboratory, or "Colab" for short, allows you to write and execute Python in your browser, with:
Colab notebooks are Jupyter notebooks that are hosted by Colab. You can find more detailed introductions to Colab here, but we will also cover the basics.
Here we are going to introduce Python and Google Colab a bit. If you are familiar with Python, start with question 11.
1. Open Colab and create a new empty notebook to work with Python 3!
Go to https://colab.research.google.com/ and login with your account. Then click on "File $\rightarrow$ New notebook".
If you want to insert a new code chunk below of the cell you are currently in, press Alt + Enter
.
If you want to stop your code from running in Colab:
ctrl + M I
or simply click the stop buttonctrl + A
to select all the code of that particular cell, press ctrl + X
to cut the entire cell code. Now the cell is empty and can be deleted by using ctrl + M D
or by pressing the delete button. You can paste your code in a new code chunk and adjust it.NB: On Macbooks, use cmd
instead of ctrl
in shortcuts.
2. Text is also known as a string variable, or as an array of characters. Create a variable a
with the text value of "Hello @Text Mining World! I'm here to learn everything, right?"
, and then print it!
3. Since this is an array, print the first and last character of your variable.
4. Use the !pip install
command and install the packages: numpy
, nltk
, gensim
, and spacy
.
Generally, you only need to install each package once on your computer and load it again, however, in Colab you may need to reinstall a package once you are reconnecting to the network.
!pip install -q numpy
!pip install -q nltk
!pip install -q gensim
!pip install -q spacy
5. Import (load) the nltk
package and use the function lower()
to convert the characters in string a
to their lowercase form and save it into a new variable b
.
NB: nltk
comes with many corpora, toy grammars, trained models, etc. A complete list is posted at: http://nltk.org/nltk_data/
To install the data, after installing nltk
, you could use the nltk.download()
data downloader. We will make use of this in Question 8.
6. Use the string
package to print the list of punctuations.
Punctuations can separate characters, words, phrases, or sentences. In some applications they are very important to the task at hand, in others they are redundant and should be removed!
7. Use the punctuation list to remove the punctuations from the lowercase form of our example string a
. Name your variable c
.
8. Use the function word_tokenize()
function from nltk
and tokenize string b
. Compare that with the tokenization of string c
.
We see that the main difference is in punctuations, however, we also see that some words are now combined togehter in the tokenization of string c
.
9. Use the function Regexptokenizer()
from nltk
to tokenize the string b
whilst removing punctuations. This way you will avoid unnecessary concatenations.
With this tokenizer, you get similar output as with tokenizing the string c
.
10. Use funtion sent_tokenize()
from the nltk
package and split the string b
into sentences. Compare that with the sentence tokenization of string c
.
An obvious question in your mind would be why sentence tokenization is needed when we have the option of word tokenization. Imagine you need to count average words per sentence. How would you calculate it? For accomplishing such a task, you need both the NLTK sentence tokenizer as well as the NLTK word tokenizer to calculate the ratio. Such output serves as an important feature for machine training as the answer would be numeric.
Pre-processing a dataset is similar to pre-processing simple text strings. First, we need to get some data. For this, we can use our own dataset, or we can scrape data from web or use social media APIs. There are also some websites with publicly available datasets:
Here, we want to analyze and pre-process the Taylor Swift song lyrics data from all her albums. The dataset can be downloaded from the course website or alternatively from Kaggle.
Upload taylor_swift_lyrics.csv
to Google Colab. You can do this by clicking on the Files button on the very left side of Colab and drag and drop the data there or click the upload button. Alternatively you can mount Google Drive and upload the dataset there.
11. Read the taylor_swift.csv
dataset. Check the dataframe using head()
and tail()
functions.
12. Add a new column to the dataframe and name it Preprocessed Lyrics
, then fill the column with the preprocessed text including the steps in this and the following questions. First replace the \n
sequences with a space character.
13. Write another custom function to remove the punctuations. You can use the previous method or make use of the function maketrans()
from the string
package.
14. Change all the characters to their lower forms. Think about why and when we need this step in our analysis.
15. List the 20 most frequent terms in this dataframe.
You see that these are mainly stop words. Before removing them let's plot a wordcloud of our data.
16. Plot a wordcloud with max 50 words using the WordCloud()
function from the wordcloud
package. Use the command ?WordCloud
to check the help for this function.
17. Use the English stop word list from the nltk
package to remove the stop words. Check the stop words and update them with your optional list of words, for example: "im", "youre", "id", "dont", "cant", "didnt", "ive", "ill", "hasnt". Show the 20 most frequent terms and plot the wordcould of 50 words again.
18. We can apply stemming or lemmatization on our text data. Apply a lemmatizer from nltk
and save the results.
And here is the code for stemming:
The PorterStemmer()
is for English language. If we are working with other languages, we can use other stemmers such as the SnowballStemmer()
which supports:
from nltk.stem.snowball import SnowballStemmer
SnowballStemmer.languages
('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')
19. Use CountVectorizer()
from the sklearn
package and build a bag of words model on Preprocessed Lyrics
based on term frequency. Check the shape of the output matrix.
20. Inspect the first 100 terms in the vocabulary.
21. Using TfidfVectorizer()
, you can create a model based on tfidf. Apply this vectorizer to your text data. Does the shape of the output matrix differ from dtm?
22. Use the TfidfVectorizer()
to create an n-gram based model with n = 1 and 2. Use the ngram_range
argument to determine the lower and upper boundary of the range of n-values for different n-grams to be extracted. (tip: use ?TfidfVectorizer
)
23. **We want to compare the lyrics of Friends theme song with the lyrics of Taylor Swift's songs and find the most similar one. Use the string below. First, apply the pre-processing steps and then transform the text into count and tfidf vectors.
Do the bag of words models agree on the most similar song to Friends theme song?**
friends_theme_lyrics = "So no one told you life was going to be this way. Your job's a joke, you're broke, you're love life's DOA. It's like you're always stuck in second gear, When it hasn\'t been your day, your week, your month, or even your year. But, I\'ll be there for you, when the rain starts to pour. I\'ll be there for you, like I\'ve been there before. I\'ll be there for you, cause you\'re there for me too."
friends_theme_lyrics
"So no one told you life was going to be this way. Your job's a joke, you're broke, you're love life's DOA. It's like you're always stuck in second gear, When it hasn't been your day, your week, your month, or even your year. But, I'll be there for you, when the rain starts to pour. I'll be there for you, like I've been there before. I'll be there for you, cause you're there for me too."