Checking for NaN presence in a container

Question #1: why is NaN found in a container when it’s an identical object.

For container types such as list, tuple, set, frozenset, dict, or
collections.deque, the expression x in y is equivalent to any(x is e
or x == e for e in y).

This is precisely what I observe with NaN, so everything is fine. Why this rule? I suspect it’s because a dict/set wants to honestly report that it contains a certain object if that object is actually in it (even if __eq__() for whatever reason chooses to report that the object is not equal to itself).

Question #2: why is the hash value for NaN the same as for 0?

From the documentation:

Called by built-in function hash() and for operations on members of
hashed collections including set, frozenset, and dict. hash()
should return an integer. The only required property is that objects
which compare equal have the same hash value; it is advised to somehow
mix together (e.g. using exclusive or) the hash values for the
components of the object that also play a part in comparison of
objects.

Note that the requirement is only in one direction; objects that have the same hash do not have to be equal! At first I thought it’s a typo, but then I realized that it’s not. Hash collisions happen anyway, even with default __hash__() (see an excellent explanation here). The containers handle collisions without any problem. They do, of course, ultimately use the == operator to compare elements, hence they can easily end up with multiple values of NaN, as long as they are not identical! Try this:

>>> nan1 = float('nan')
>>> nan2 = float('nan')
>>> d = {}
>>> d[nan1] = 1
>>> d[nan2] = 2
>>> d[nan1]
1
>>> d[nan2]
2

So everything works as documented. But… it’s very very dangerous! How many people knew that multiple values of NaN could live alongside each other in a dict? How many people would find this easy to debug?..

I would recommend to make NaN an instance of a subclass of float that doesn’t support hashing and hence cannot be accidentally added to a set/dict. I’ll submit this to python-ideas.

Finally, I found a mistake in the documentation here:

For user-defined classes which do not define __contains__() but do
define __iter__(), x in y is true if some value z with x == z is
produced while iterating over y. If an exception is raised during the
iteration, it is as if in raised that exception.

Lastly, the old-style iteration protocol is tried: if a class defines
__getitem__(), x in y is true if and only if there is a non-negative
integer index i such that x == y[i], and all lower integer indices do
not raise IndexError exception. (If any other exception is raised, it
is as if in raised that exception).

You may notice that there is no mention of is here, unlike with built-in containers. I was surprised by this, so I tried:

>>> nan1 = float('nan')
>>> nan2 = float('nan')
>>> class Cont:
...   def __iter__(self):
...     yield nan1
...
>>> c = Cont()
>>> nan1 in c
True
>>> nan2 in c
False

As you can see, the identity is checked first, before == – consistent with the built-in containers. I’ll submit a report to fix the docs.

More Related Contents:

Leave a Comment Cancel reply