If you're using Google Colab, then you should be able to run these commands directly. Otherwise, make sure you have sklearn
, matplotlib
and numpy
installed.
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA
from numpy import linalg as LA
gensim
library¶In this practical session we're going to use the gensim
library. This library offers a variety of methods to read
in pre-trained word embeddings as well as train your own.
The website contains a lot of documentation, for example here: https://radimrehurek.com/gensim/auto_examples/index.html#documentation
If gensim isn't installed yet, you can use the following command:
#!pip install gensim
from gensim.test.utils import datapath
First we load in a pre-trained GloVe model. Note: this can take around five minutes.
See https://github.com/RaRe-Technologies/gensim-data for an overview of the models you can try. For example
We're going to start with glove-wiki-gigaword-300
which
is 376.1MB to download. These embeddings are trained on
Wikipedia (2014) and the Gigaword corpus, a large collection
of newswire text.
import gensim.downloader as api
wv = api.load('glove-wiki-gigaword-300')
[=================================================-] 99.7% 374.8/376.1MB downloaded
How many words does the vocabulary contain?
Is 'utrecht' in the vocabulary?
Print a word embedding.
How many dimensions does this embedding have?
Question: Explore the embeddings for a few other words. Can you find words that are not in the vocabulary?
(For example, think of uncommon words, misspellings, etc.)
We can calculate the cosine similarity between two words in this way:
wv.similarity('university', 'student')
0.5970514
Optional: cosine similarity is the same as the dot product between the normalized word embeddings
wv_university_norm = wv['university']/ LA.norm(wv['university'], 2)
wv_student_norm = wv['student'] / LA.norm(wv['student'], 2)
wv_university_norm.dot(wv_student_norm)
0.5970514
A normalized embedding has a L2 norm (length) of 1
LA.norm(wv_student_norm)
1.0
Print the top 5 most similar words to car
Question: What are the top 5 most similar words to cat? And to king? And to fast? What kind of words often appear in the top?
Now calculate the similarities between two words:
We can calculate the cosine similarity between a list of word pairs and correlate these with human ratings. One such dataset with human ratings is called WordSim353.
Goto https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/wordsim353.tsv to get a sense of the data.
Gensim already implements a method to evaluate a word embedding model using this data.
See https://radimrehurek.com/gensim/models/keyedvectors.html for a description of the methods.
wv.evaluate_word_pairs(datapath('wordsim353.tsv'))
(PearsonRResult(statistic=0.6040760940127656, pvalue=1.752303459427883e-36), SignificanceResult(statistic=0.6085349998820805, pvalue=3.879629536780527e-37), 0.0)
Man is to woman as king is to. ..?
This can be converted into vector arithmethics:
king - ? = man - woman.
king - man + woman = ?
france - paris + amsterdam = ?
Note that it we would just retrieve the most similar words to 'amsterdam' we would receive a different result.
cat is to cats as girl is to ?
girl - ? = cat - cats
girl - cat + cats = ?
Compare against a baseline. What if we would just have retrieved the most similar words to 'girl'?
Question: Try a few of your own analogies, do you get the expected answer?
We can't visualize embeddings in their raw format, because of their high dimensionality. However, we can use dimensionality reduction techniques such as PCA to project them onto a 2D space.
Question: What do you notice in this plot? Do the distances between the words make sense? Any surprises? Feel free to add your own words!
Is math more associated with male or female words?
Compute the average cosine similarity between the target word and the set of attribute words.
What about poetry?
Repeat this analysis but now with the GloVe model trained on Twitter data,
e.g. glove-twitter-50
.
(only if you have time, this can take a while (11 min?))
Load in a fastText model
wv_f = api.load('fasttext-wiki-news-subwords-300')
[=================================================-] 100.0% 958.0/958.4MB downloaded
Usage is very similar to before. For example, we can calculate the similarity between two words:
Question How does this compare to the similarity scores you obtained with the other models you tried?