Python frozenset hashing algorithm / implementation

The problem being solved is that the previous hash algorithm in Lib/sets.py had horrendous performance on datasets that arise in a number of graph algorithms (where nodes are represented as frozensets):

# Old-algorithm with bad performance

def _compute_hash(self):
    result = 0
    for elt in self:
        result ^= hash(elt)
    return result

def __hash__(self):
    if self._hashcode is None:
        self._hashcode = self._compute_hash()
    return self._hashcode

A new algorithm was created because it had much better performance. Here is an overview of the salient parts of the new algorithm:

1) The xor-equal in h ^= (hx ^ (hx << 16) ^ 89869747) * 3644798167 is necessary so that the algorithm is commutative (the hash does not depend on the order that set elements are encountered). Since sets has an unordered equality test, the hash for frozenset([10, 20]) needs to be the same as for frozenset([20, 10]).

2) The xor with89869747 was chosen for its interesting bit pattern 101010110110100110110110011 which is used to break-up sequences of nearby hash values prior to multiplying by 3644798167, a randomly chosen large prime with another interesting bit pattern.

3) The xor with hx << 16 was included so that the lower bits had two chances to affect the outcome (resulting in better dispersion of nearby hash values). In this, I was inspired by how CRC algorithms shuffled bits back on to themselves.

4) If I recall correctly, the only one of the constants that is special is 69069. It had some history from the world of linear congruential random number generators. See https://www.google.com/search?q=69069+rng for some references.

5) The final step of computing hash = hash * 69069U + 907133923UL was added to handle cases with nested frozensets and to make the algorithm disperse in a pattern orthogonal to the hash algorithms for other objects (strings, tuples, ints, etc).

6) Most of the other constants were randomly chosen large prime numbers.

Though I would like to claim divine inspiration for the hash algorithm, the reality was that I took a bunch of badly performing datasets, analyzed why their hashes weren’t dispersing, and then toyed with the algorithm until the collision statistics stopped being so embarrassing.

For example, here is an efficacy test from Lib/test/test_set.py that failed for algorithms with less diffusion:

def test_hash_effectiveness(self):
    n = 13
    hashvalues = set()
    addhashvalue = hashvalues.add
    elemmasks = [(i+1, 1<<i) for i in range(n)]
    for i in xrange(2**n):
        addhashvalue(hash(frozenset([e for e, m in elemmasks if m&i])))
    self.assertEqual(len(hashvalues), 2**n)

Other failing examples included powersets of strings and small integer ranges as well as the graph algorithms in the test suite: See TestGraphs.test_cuboctahedron and TestGraphs.test_cube in Lib/test/test_set.py.

More Related Contents:

Leave a Comment Cancel reply