How to speed up Gensim Word2vec model load time?

In recent gensim versions you can load a subset starting from the front of the file using the optional limit parameter to load_word2vec_format(). (The GoogleNews vectors seem to be in roughly most- to least- frequent order, so the first N are usually the N-sized subset you’d want. So use limit=500000 to get the most-frequent 500,000 words’ vectors – still a fairly large vocabulary – saving 5/6ths of the memory/load-time.)

So that may help a bit. But if you’re re-loading for every web-request, you’ll still be hurting from loading’s IO-bound speed, and the redundant memory overhead of storing each re-load.

There are some tricks you can use in combination to help.

Note that after loading such vectors in their original word2vec.c-originated format, you can re-save them using gensim’s native save(). If you save them uncompressed, and the backing array is large enough (and the GoogleNews set is definitely large enough), the backing array gets dumped in a separate file in a raw binary format. That file can later be memory-mapped from disk, using gensim’s native [load(filename, mmap='r')][1] option.

Initially, this will make the load seem snappy – rather than reading all the array from disk, the OS will just map virtual address regions to disk data, so that some time later, when code accesses those memory locations, the necessary ranges will be read-from-disk. So far so good!

However, if you are doing typical operations like most_similar(), you’ll still face big lags, just a little later. That’s because this operation requires both an initial scan-and-calculation over all the vectors (on first call, to create unit-length-normalized vectors for every word), and then another scan-and-calculation over all the normed vectors (on every call, to find the N-most-similar vectors). Those full-scan accesses will page-into-RAM the whole array – again costing the couple-of-minutes of disk IO.

What you want is to avoid redundantly doing that unit-normalization, and to pay the IO cost just once. That requires keeping the vectors in memory for re-use by all subsequent web requestes (or even multiple parallel web requests). Fortunately memory-mapping can also help here, albeit with a few extra prep steps.

First, load the word2vec.c-format vectors, with load_word2vec_format(). Then, use model.init_sims(replace=True) to force the unit-normalization, destructively in-place (clobbering the non-normalized vectors).

Then, save the model to a new filename-prefix: model.save(‘GoogleNews-vectors-gensim-normed.bin’`. (Note that this actually creates multiple files on disk that need to be kept together for the model to be re-loaded.)

Now, we’ll make a short Python program that serves to both memory-map load the vectors, and force the full array into memory. We also want this program to hang until externally terminated (keeping the mapping alive), and be careful not to re-calculate the already-normed vectors. This requires another trick because the loaded KeyedVectors actually don’t know that the vectors are normed. (Usually only the raw vectors are saved, and normed versions re-calculated whenever needed.)

Roughly the following should work:

from gensim.models import KeyedVectors
from threading import Semaphore
model = KeyedVectors.load('GoogleNews-vectors-gensim-normed.bin', mmap='r')
model.syn0norm = model.syn0  # prevent recalc of normed vectors
model.most_similar('stuff')  # any word will do: just to page all in
Semaphore(0).acquire()  # just hang until process killed

This will still take a while, but only needs to be done once, before/outside any web requests. While the process is alive, the vectors stay mapped into memory. Further, unless/until there’s other virtual-memory pressure, the vectors should stay loaded in memory. That’s important for what’s next.

Finally, in your web request-handling code, you can now just do the following:

model = KeyedVectors.load('GoogleNews-vectors-gensim-normed.bin', mmap='r')
model.syn0norm = model.syn0  # prevent recalc of normed vectors
# … plus whatever else you wanted to do with the model

Multiple processes can share read-only memory-mapped files. (That is, once the OS knows that file X is in RAM at a certain position, every other process that also wants a read-only mapped version of X will be directed to re-use that data, at that position.).

So this web-reqeust load(), and any subsequent accesses, can all re-use the data that the prior process already brought into address-space and active-memory. Operations requiring similarity-calcs against every vector will still take the time to access multiple GB of RAM, and do the calculations/sorting, but will no longer require extra disk-IO and redundant re-normalization.

If the system is facing other memory pressure, ranges of the array may fall out of memory until the next read pages them back in. And if the machine lacks the RAM to ever fully load the vectors, then every scan will require a mixing of paging-in-and-out, and performance will be frustratingly bad not matter what. (In such a case: get more RAM or work with a smaller vector set.)

But if you do have enough RAM, this winds up making the original/natural load-and-use-directly code “just work” in a quite fast manner, without an extra web service interface, because the machine’s shared file-mapped memory functions as the service interface.

Leave a Comment