Computing np.diff in Pandas after using groupby leads to unexpected result

Nice easy to reproduce example!! more questions should be like this!

Just pass a lambda to transform (this is tantamount to passing afuncton object, e.g. np.diff (or Series.diff) directly. So this equivalent to data1/data2

In [32]: data3['diffs'] = data3.groupby('ticker')['value'].transform(Series.diff)

In [34]: data3.sort_index(inplace=True)

In [25]: data3
Out[25]: 
         date    ticker     value     diffs
0  2013-10-03  ticker_2  0.435995  0.015627
1  2013-10-04  ticker_2  0.025926 -0.410069
2  2013-10-02  ticker_1  0.549662       NaN
3  2013-10-01  ticker_0  0.435322       NaN
4  2013-10-02  ticker_2  0.420368  0.120713
5  2013-10-03  ticker_0  0.330335 -0.288936
6  2013-10-04  ticker_1  0.204649 -0.345014
7  2013-10-02  ticker_0  0.619271  0.183949
8  2013-10-01  ticker_2  0.299655       NaN

[9 rows x 4 columns]

I believe that np.diff doesn’t follow numpy’s own unfunc guidelines to process array inputs (whereby it tries various methods to coerce input and send output, e.g. __array__ on input __array_wrap__ on output). I am not really sure why, see a bit more info here. So bottom line is that np.diff is not dealing with the index properly and doing its own calculation (which in this case is wrong).

Pandas has a lot of methods where they don’t just call the numpy function, mainly because they handle different dtypes, handle nans, and in this case, handle ‘special’ diffs. e.g. you can pass a time frequency to a datelike-index where it calculates how many n to actually diff.

Leave a Comment