Getting days since last occurence in Pandas DataFrame?

Here’s a NumPy approach –

def intervaled_cumsum(a, trigger_val=1, start_val = 0, invalid_specifier=-1):
    out = np.ones(a.size,dtype=int)    
    idx = np.flatnonzero(a==trigger_val)
    if len(idx)==0:
        return np.full(a.size,invalid_specifier)
    else:
        out[idx[0]] = -idx[0] + 1
        out[0] = start_val
        out[idx[1:]] = idx[:-1] - idx[1:] + 1
        np.cumsum(out, out=out)
        out[:idx[0]] = invalid_specifier
        return out

Few sample runs on array data to showcase the usage covering various scenarios of trigger and start values :

In [120]: a
Out[120]: array([0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0])

In [121]: p1 = intervaled_cumsum(a, trigger_val=1, start_val=0)
     ...: p2 = intervaled_cumsum(a, trigger_val=1, start_val=1)
     ...: p3 = intervaled_cumsum(a, trigger_val=0, start_val=0)
     ...: p4 = intervaled_cumsum(a, trigger_val=0, start_val=1)
     ...: 

In [122]: np.vstack(( a, p1, p2, p3, p4 ))
Out[122]: 
array([[ 0,  1,  1,  1,  0,  0,  1,  0,  0,  1,  1,  1,  1,  1,  0],
       [-1,  0,  0,  0,  1,  2,  0,  1,  2,  0,  0,  0,  0,  0,  1],
       [-1,  1,  1,  1,  2,  3,  1,  2,  3,  1,  1,  1,  1,  1,  2],
       [ 0,  1,  2,  3,  0,  0,  1,  0,  0,  1,  2,  3,  4,  5,  0],
       [ 1,  2,  3,  4,  1,  1,  2,  1,  1,  2,  3,  4,  5,  6,  1]])

Using it to solve our case :

df['Last_Occurence'] = intervaled_cumsum(df.Value.values)

Sample output –

In [181]: df
Out[181]: 
       Date  Value  Last_Occurence
0  01/01/17      0              -1
1  01/02/17      0              -1
2  01/03/17      1               0
3  01/04/17      0               1
4  01/05/17      0               2
5  01/06/17      0               3
6  01/07/17      1               0
7  01/08/17      0               1
8  01/09/17      0               2

Runtime test

Approaches –

# @Scott Boston's soln
def pandas_groupby(df):
    mask = df.Value.cumsum().replace(0,False).astype(bool)
    return df.assign(Last_Occurance=df.groupby(df.Value.astype(bool).\
                                    cumsum()).cumcount().where(mask))

# Proposed in this post
def numpy_based(df):
    df['Last_Occurence'] = intervaled_cumsum(df.Value.values)

Timings –

In [33]: df = pd.DataFrame((np.random.rand(10000000)>0.7).astype(int), columns=[['Value']])

In [34]: %timeit pandas_groupby(df)
1 loops, best of 3: 1.06 s per loop

In [35]: %timeit numpy_based(df)
10 loops, best of 3: 103 ms per loop

In [36]: df = pd.DataFrame((np.random.rand(100000000)>0.7).astype(int), columns=[['Value']])

In [37]: %timeit pandas_groupby(df)
1 loops, best of 3: 11.1 s per loop

In [38]: %timeit numpy_based(df)
1 loops, best of 3: 1.03 s per loop

More Related Contents:

Leave a Comment Cancel reply