sklearn
¶1. Upload the "book_reviews.csv" from your machine, following the Colab documentation. This file contains 10,000 English language book reviews from Goodreads, with genre, age and star rating labels. Uploading may take a minute or so.
2. Load the .csv file into a Pandas dataframe. This makes it easy to acess and filter data.
3. Now you can construct the document-term matrix. The CountVectorizer
class counts how often each word occurs in each document. Optionally, you can also pass ngram_range
as a parameter, to see if combinations of multiple words are better predictors for ratings. Define the output of the fit_transform
function on 'tokenised_text'
as your feature matrix X
, and the star ratings ('rating_no'
) as the variable y
you're trying to predict.
To inspect the words in the document-term matrix, you can call get_feature_names_out()
on the vectorizer.
Alternatively, you could also use a TfidfVectorizer
: this class counts how often a word occurs in a document and weighs it against how often the word occurs in the whole corpus. This is a way to eliminate words that are frequent but not very meaningful. You can play around with different vectorizers to see how they affect your results.
4. Now we can define a baseline model: use the DummyClassifier
to always predict the most frequent genre in the dataset.
5. After defining your document-term matrix, you can split the data into train- and test sets. Note that random_state
is used so that the split will be the same for everyone in the group, such that different random selections don't cause slightly different results.
6. Now pick one of the following classifiers:
7. Find the parameters which lead to best results. You can also automatate this with GridSearch, as shown below.
8. Try combining multiple classifiers, for instance with a Voting Classifier Can you get a better result?