Practical 5: Word Embeddings¶

Dong Nguyen¶

Applied Text Mining - Utrecht Summer School¶

Installation¶

If you're using Google Colab, then you should be able to run these commands directly. Otherwise, make sure you have sklearn, matplotlib and numpy installed.

In [1]:

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

from sklearn.decomposition import PCA
from numpy import linalg as LA

First install the `gensim` library¶

In this practical session we're going to use the gensim library. This library offers a variety of methods to read in pre-trained word embeddings as well as train your own.

The website contains a lot of documentation, for example here: https://radimrehurek.com/gensim/auto_examples/index.html#documentation

If gensim isn't installed yet, you can use the following command:

In [ ]:

#!pip install gensim

In [2]:

from gensim.test.utils import datapath

Reading in a pre-trained model¶

First we load in a pre-trained GloVe model. Note: this can take around five minutes.

See https://github.com/RaRe-Technologies/gensim-data for an overview of the models you can try. For example

word2vec-google-news-300: word2vec trained on Google news. 1662 MB.
glove-twitter-200: trained on Twitter: 758 MB

We're going to start with glove-wiki-gigaword-300 which is 376.1MB to download. These embeddings are trained on Wikipedia (2014) and the Gigaword corpus, a large collection of newswire text.

In [3]:

import gensim.downloader as api
wv = api.load('glove-wiki-gigaword-300')

[=================================================-] 99.7% 374.8/376.1MB downloaded

Exploring the vocabulary¶

How many words does the vocabulary contain?

Is 'utrecht' in the vocabulary?

Print a word embedding.

How many dimensions does this embedding have?

Question: Explore the embeddings for a few other words. Can you find words that are not in the vocabulary?

(For example, think of uncommon words, misspellings, etc.)

Vector arithmethics¶

We can calculate the cosine similarity between two words in this way:

In [10]:

wv.similarity('university', 'student')

Out[10]:

0.5970514

Optional: cosine similarity is the same as the dot product between the normalized word embeddings

In [11]:

wv_university_norm = wv['university']/ LA.norm(wv['university'], 2)
wv_student_norm = wv['student'] / LA.norm(wv['student'], 2)

wv_university_norm.dot(wv_student_norm)

Out[11]:

0.5970514

A normalized embedding has a L2 norm (length) of 1

In [12]:

LA.norm(wv_student_norm)

Out[12]:

1.0

Similarity analysis¶

Print the top 5 most similar words to car

Question: What are the top 5 most similar words to cat? And to king? And to fast? What kind of words often appear in the top?

Now calculate the similarities between two words:

buy, purchase
cat, dog
car, green

We can calculate the cosine similarity between a list of word pairs and correlate these with human ratings. One such dataset with human ratings is called WordSim353.

Goto https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/wordsim353.tsv to get a sense of the data.

Gensim already implements a method to evaluate a word embedding model using this data.

It calculates the cosine similarity between each word pair
It calculates both the Spearman and Pearson correlation coefficient between the cosine similarities and human judgements

See https://radimrehurek.com/gensim/models/keyedvectors.html for a description of the methods.

In [17]:

wv.evaluate_word_pairs(datapath('wordsim353.tsv'))

Out[17]:

(PearsonRResult(statistic=0.6040760940127656, pvalue=1.752303459427883e-36),
 SignificanceResult(statistic=0.6085349998820805, pvalue=3.879629536780527e-37),
 0.0)

Analogies¶

Man is to woman as king is to. ..?

This can be converted into vector arithmethics:

king - ? = man - woman.

king - man + woman = ?

france - paris + amsterdam = ?

Note that it we would just retrieve the most similar words to 'amsterdam' we would receive a different result.

cat is to cats as girl is to ?

girl - ? = cat - cats
girl - cat + cats = ?

Compare against a baseline. What if we would just have retrieved the most similar words to 'girl'?

Question: Try a few of your own analogies, do you get the expected answer?

Visualization¶

We can't visualize embeddings in their raw format, because of their high dimensionality. However, we can use dimensionality reduction techniques such as PCA to project them onto a 2D space.

Question: What do you notice in this plot? Do the distances between the words make sense? Any surprises? Feel free to add your own words!

Biases¶

Is math more associated with male or female words?

Compute the average cosine similarity between the target word and the set of attribute words.

What about poetry?

Next¶

Repeat this analysis but now with the GloVe model trained on Twitter data, e.g. glove-twitter-50.

Which model obtains better performance on the word similarity task? (WordSim353?)
What other differences do you observe? (e.g. think of the vocabulary, biases, etc.)

FastText¶

(only if you have time, this can take a while (11 min?))

Load in a fastText model

In [30]:

wv_f = api.load('fasttext-wiki-news-subwords-300')

[=================================================-] 100.0% 958.0/958.4MB downloaded

Usage is very similar to before. For example, we can calculate the similarity between two words:

dog, dogs

Question How does this compare to the similarity scores you obtained with the other models you tried?