Using in operator with Pandas series [duplicate]

In the first case:

Because the in operator is interpreted as a call to df['name'].__contains__('Adam'). If you look at the implementation of __contains__ in pandas.Series, you will find that it’s the following (inhereted from pandas.core.generic.NDFrame) :

def __contains__(self, key):
    """True if the key is in the info axis"""
    return key in self._info_axis

so, your first use of in is interpreted as:

'Adam' in df['name']._info_axis

This gives False, expectedly, because df['name']._info_axis actually contains information about the range/index and not the data itself:

In [37]: df['name']._info_axis 
Out[37]: RangeIndex(start=0, stop=3, step=1)

In [38]: list(df['name']._info_axis) 
Out[38]: [0, 1, 2]

In the second case:

'Adam' in list(df['name'])

The use of list, converts the pandas.Series to a list of the values. So, the actual operation is this:

In [42]: list(df['name'])
Out[42]: ['Adam', 'Ben', 'Chris']

In [43]: 'Adam' in ['Adam', 'Ben', 'Chris']
Out[43]: True

Here are few more idiomatic ways to do what you want (with the associated speed):

In [56]: df.name.str.contains('Adam').any()
Out[56]: True

In [57]: timeit df.name.str.contains('Adam').any()
The slowest run took 6.25 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 144 µs per loop

In [58]: df.name.isin(['Adam']).any()
Out[58]: True

In [59]: timeit df.name.isin(['Adam']).any()
The slowest run took 5.13 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 191 µs per loop

In [60]: df.name.eq('Adam').any()
Out[60]: True

In [61]: timeit df.name.eq('Adam').any()
10000 loops, best of 3: 178 µs per loop

Note: the last way is also suggested by @Wen in the comment above

In the first case:

In the second case:

More Related Contents:

Leave a Comment Cancel reply