A Comprehensive Python Implementation of GloVe

Train the full GloVe model on a single machine

Peng Yan
Towards Data Science

--

Photo by Cookie the Pom on Unsplash

As an NLP data scientist, I frequently read papers with topics varying from word vectors, RNNs, and transformers. Reading paper is fun, and give me the illusion that I have mastered a wide range of techniques. But when it comes to reproducing them, difficulties emerge. As far as I know, many NLP learners have run into the same situation as me. Thus I decided to start a series of posts focusing on implementing classical NLP papers. I also created a GitHub repository for this effort.

This post is the first of this series, which reproduces the GloVe model based on the original paper. As stated before, the focus is purely on the implementation. For more information about the underlying theory, please refer to the original paper.

According to the paper, the GloVe model was trained with a single machine. The released code was written in C, which can be somewhat unfamiliar for NLP learners. So I carried out a comprehensive Python implementation of the model, which aligns with the goal of training a huge vocabulary with only a single machine. The following sections walk through the implementation details step by step. The full code is here.

Step 0: Preparation

Training Data

For this project, I use the Text8 Dataset as the training data. To get it, we can use the gensim downloader:

import gensim.downloader as apidataset = api.load("text8")

The dataset is a list of lists, where each sublist is a list of words representing a sentence. We only want a list of all the words, so flatten it with itertools:

import itertoolscorpus = list(itertools.chain.from_iterable(dataset))

Okay, now we have the training corpus.

Storing Parameters

When working on a machine learning model, there is always a wide range of parameters to configure, such as data file path, batch size, word embedding size, etc. These parameters can incur lots of overhead if you don’t manage them well. Based on my experience, I find the best way is to store all of them in a single yaml file with the name config.yaml. In the code, also add a loading function to load configuration from the yaml file, like this:

Code snippet for loading configuration from config.yaml

Then for the rest of the code, we can use the parameters as config.batch_size, config.learning_rate instead of hard coded values, which also makes the code nicer.

And that’s all the preparation work needed. Let’s proceed with the actual two-step training of the GloVe model!

Step 1: Counting Cooccurring Pairs

Creating Vocabulary

For counting cooccurring pairs, we first need to determine the vocabulary. Here are some requirements for the vocabulary:

  • It is a set of tokens appearing in the corpus.
  • Each token is mapped to an integer.
  • If a token doesn’t belong to the corpus, it should be represented as an unknown token, or “unk”.
  • For counting cooccurring pairs, only a subset of the tokens are needed, such as the top k most frequent tokens.

To fulfill these requirements in a structured manner, a Vocabulary class is created. This class has four fields:

  • token2index: A dict that maps a token to an index. The index starts from 0, and increments by 1 each time a previously unseen token is added.
  • index2token: A dict that maps an index to a token.
  • token_counts: A list where the ith value is the count of the token with index i.
  • _unk_token: An integer used as the index for unknown tokens. The default value is -1.

It also defines the following methods:

  • add(token): Add a new token into the vocabulary. If previously unseen, a new index is generated. The token’s count is updated as well.
  • get_index(token): Return the index of the token.
  • get_token(index): Return the token corresponding to the index.
  • get_topk_subset(k): Create a new vocabulary with the top k most frequent tokens.
  • shuffle(): Randomly shuffle all the tokens so that the mapping between tokens and indices is randomized. The reason why this method is needed will be uncovered later, when we actually count cooccurring pairs.

With this understanding in mind, we can now look at the code:

The Vocabulary class

For the class implementation, I make use of Python’s dataclass feature. With this feature, I only need to define the fields with type annotation, and the __init__() method is automatically generated for me. I can also set default value for fields when defining them. For example, token2index is default to an empty dict by setting default_factory=dict. For more information about dataclass, please refer to the official documentation.

Now we have the Vocabulary class, the remaining question is: how do we use it? There are basically two use cases:

  • Create a vocabulary from the corpus, which consists of the top k most frequent tokens.
  • When counting cooccurring pairs, use the created vocabulary to convert the corpus, which is a list of tokens, into integer indices.

I create another class, Vectorizer, to coordinate these two use cases. It only has one field, vocab, which refers to the vocabulary created from the corpus. It has two methods:

  • from_corpus(corpus, vocab_size): This is a class method. First, a vocabulary is created by adding all the tokens in the corpus. Then the top vocab_size most frequent tokens are selected to create a new vocabulary. This vocabulary is shuffled and used to instantiate the Vectorizer instance. The reason for shuffling will be explained later.
  • vectorize(corpus): Convert the given corpus, which is a list of tokens, into a list of indices.

The full code is as below:

The Vectorizer class

Scanning Context Windows

Now we have the vectorizer to convert all words into indices, the remaining task is to scan all the context windows and count all possible cooccurring pairs. Since the cooccurrence matrix is sparse, it is reasonable to use a Counter to count the pairs. The key is (word i’s index, word j’s index), where word j appears in the context of word i. The value is a floating number representing the counts. However, two issues may occur if using this strategy.

Issue 1: If we count all cooccurring pairs in one scan, we will likely run out of memory, as the number of distinct (word i’s index, word j’s index) can be enormous.

Solution: Instead, we can count cooccurring pairs in multiple scans. In each scan, we limit word i’s index to be in a small range, so that the number of distinct pairs is greatly reduced. Let’s say the vocabulary has 100,000 distinct tokens. If we count all pairs in one scan, the number of distinct pairs can be as large as 10¹⁰. Instead, we can count all pairs in 10 scans. In the first scan, we limit word i’s index to be in between 0 and 9999; in the second scan, we limit it to be in betweeen 10000 and 19999; in the third scan, we limit it to be in between 20000 and 29999, so on and so forth. And after each scan finishes, we save the counts to disk. Now in each scan, the number of distinct pairs can be as large as 10⁹, which is one tenth of the original number.

The idea behind this approach is that instead of calculating the whole cooccurrence matrix in one scan, we divide the matrix into 10 smaller rectangles and calculate them sequentially. The picture below visualizes the idea.

Left: count in one scan | Right: count in multiple scans

This approach is scalable in the sense that as the vocabulary size increases, we can always increase the number of scans to reduce the memory usage. The main drawback is that the runtime will also increase if using a single machine. However, since there is no dependency between scans, they an be easily parallelized with Spark. But this is out of our scope.

Also, at this point, the reason for shuffling the vocabulary can be uncovered. When we create the vocabulary with the most frequent tokens, the index of these tokens are ordered. Index 0 corresponds to the most frequent token, index 1 corresponds to the second most frequent token, so on and so forth. If we continue with the example of 100,000 tokens, in the first scan we would count pairs of the 10000 most frequent tokens, and the number of distinct pairs would be huge. While in the remaining scans, the number of distinct pairs would be much smaller. This leads to memory usage imbalance between scans. By shuffling the vocabulary, the distinct pairs are distributed evenly across scans and the memory usage is balanced.

Issue 2: Continue from the solution to Issue 1, how do we save the counts of each scan to disk? The most obvious way is to write the (word i’s index, word j’s index, count) triplets into a shared text file between scans. But using this file later for training involves too much overhead.

Solution: There is a python library, h5py, that provides Pythonic interface to the HDF5 binary format. It enables you to store huge amounts of numerical data, and easily manipulate them as if they were real NumPy arrays. For more detail about the library, check out its documentation.

Same as before, I create a CooccurrenceEntries class that does the counting and saves the result to disk using the proposed solutions. The class has two fields:

  • vectorizer: The Vectorizer instance created from the corpus.
  • vectorized_corpus: A list of word indices. This is the result of vectorizing the original corpus, which is a list of words, with the vectorizer.

It has two main methods:

  • setup(corpus, vectorizer): This is a class method used to create the CooccurrenceEntries instance. The vectorized_corpus is generated by calling vectorizer’s vectorize method on the corpus.
  • build(window_size, num_partitions, chunk_size, output_directory=“.” ): This method counts the cooccurring pairs in num_partitions scans, and write the result to the output directory. The chunk_size argument is used to save the data in chunks using HDF5 format. The reason for saving in chunks will be discussed in the model training section. In short, it is used for faster generating training batches.

The implementation is as below:

The CooccurrenceEntrie class

With the abstraction of Vocabulary, Vectorizer, CooccurrenceEntrie, the code for counting cooccurring pairs and saving to disk is simple:

Code snippet for creating training data

Step 2. Training GloVe Model

Loading Batches from HDF5 Dataset

We first need to load data from the HDF5 dataset in batches. Since the data can be retrieved as if it is stored in a NumPy matrix, the easiest way is to use a PyTorch DataLoader. But then loading each batch involves many calls in the form of dataset[i], where the dataset is a h5py.Dataset instance. This involves many IO calls and can be extremely slow.

The workaround is to load the h5py.Dataset into memory chunk by chunk. Each loaded chunk is a pure NumPy ndarray in the memory, so we can use PyTorch’s Dataloader to iterate batches over it. Now the number of IO calls needed is equal to the number of chunks, which is significantly smaller.

One drawback with this approach is that completely random shuffle is not possible, as batches containing data from different chunks will never be generated. So to get more randomness, we can load the chunks in random order, and set DataLoader’s shuffle argument to be True.

An HDF5DataLoader class is created for loading batches. It has five fields:

  • filepath: The path to the HDF5 file.
  • dataset_name: The name of the h5py.Dataset in the file.
  • batch_size: The training batch size.
  • device: The training device, can be cpu or gpu.
  • dataset: The h5py.Dataset instance in the file.

And it has two methods:

  • open(): This method opens the HDF5 file and locates the dataset. The actual reading doesn’t happen here.
  • iter_batches(): This method loads the chunks in random order and creates PyTorch DataLoaders to iterate over batches within them.

The code is shown below. One thing to notice is that the CooccurrenceDataset is simply a subclass of PyTorch’s Dataset for indexing the data. It is omitted as there is nothing special.

The HDF5DataLoader class

Coding GloVe Model

Implementing GloVe model with PyTorch is straightforward. We define the two weight matrices and the two bias vectors in __init__(). Notice that we set sparse=True when creating the embeddings, as the gradient update is sparse by nature. In forward(), the average batch loss is returned.

The GloVe class

Training GloVe Model

The model training follows the standard PyTorch training routine. The only difference is that instead of PyTorch’s DataLoader, we use the customized HDF5Loader to generate batches. Here is the training code:

Code snippet for model training

Phew, we have walked through the full implementation. Congratulations!

Next, let’s train the model and look at the results!

Step 3. Results

For the Text8 Dataset, training one epoch takes roughly 80 mins. I trained the model for 20 epochs and it takes more than one day to finish. The learning curve looks promising, and it seems like the loss would further decrease if the training continues.

A plot of the learning curve

We can also do some word similarity task to see how the word vectors behave. Here I make use of the KeyedVectors class from gensim, which allows you to do this without writing nearest neighbor or cosine similarity code. The similarity evaluation code is here. For details about KeyedVectors, please refer to the documentation.

Running some simple similarity task shows the following result:

Results of some simple similarity task

As we can see, some of them make sense, like “computer” and “game”, “united” and “states”; and some of them not. But that’s enough to make the point. Training on a larger dataset for more epochs should improve the result.

Summary

The GloVe paper is well written and easy to follow. Yet when it comes to implementation, there are plenty of pitfalls and difficulties along the way, especially when you take memory issues into consideration. With a decent amount of effort, we end up with a satisfying solution for training on a single machine.

As I said at the beginning, I will continue implementing more NLP papers and share my first-hand experience with you. Hope you enjoy the GloVe implementaion and see you in the next post!

--

--