Improving performance of very large dictionary in Python

If I know the number of keys and exactly what are those keys, is there
any way in python to make a dict (or a hashtable) work more
efficiently? I vaguely remember that if you know the keys, you can
design the hash function smartly (perfect hash?) and allocate the
space beforehand.

Python doesn’t expose a pre-sizing option to speed-up the “growth phase” of a dictionary, nor does it provide any direct controls over “placement” in the dictionary.

That said, if the keys are always known in advance, you can store them in a set and build your dictionaries from the set using dict.fromkeys(). That classmethod is optimized to pre-size the dictionary based on the set size and it can populate the dictionary without any new calls to __hash__():

>>> keys = {'red', 'green', 'blue', 'yellow', 'orange', 'pink', 'black'}
>>> d = dict.fromkeys(keys)  # dict is pre-sized to 32 empty slots

If reducing collisions is your goal, you can run experiments on the insertion order in the dictionary to minimize pile-ups. (Take a look at Brent’s variation on Algorithm D in Knuth’s TAOCP to get an idea of how this is done).

By instrumenting a pure Python model for dictionaries (such as this one), it is possible to count the weighted-average number of probes for an alternative insertion order. For example, inserting dict.fromkeys([11100, 22200, 44400, 33300]) averages 1.75 probes per lookup. That beats the 2.25 average probes per lookup for dict.fromkeys([33300, 22200, 11100, 44400]).

Another “trick” is to increase spareness in a fully populated dictionary by fooling it into increasing its size without adding new keys:

 d = dict.fromkeys(['red', 'green', 'blue', 'yellow', 'orange'])
 d.update(dict(d))     # This makes room for additional keys
                       # and makes the set collision-free.

Lastly, you can introduce your own custom __hash__() for your keys with the goal of eliminating all collisions (perhaps using a perfect hash generator such as gperf).

Leave a Comment