How to do/workaround a conditional join in python Pandas?

Well, I can think of a few ways:

  1. essentially blow up the dataframe by just merging on the exact field (company)… then filter on the 30-day windows after the merge.
  • should be fast but could use lots of memory
  1. Move the merging and filtering on the 30-day window into a groupby().
  • results in a merge for each group, so slower but should use less memory

Option #1

Suppose your data looks like the following (I expanded your sample data):

print df

    company       date  measure
0         0 2010-01-01       10
1         0 2010-01-15       10
2         0 2010-02-01       10
3         0 2010-02-15       10
4         0 2010-03-01       10
5         0 2010-03-15       10
6         0 2010-04-01       10
7         1 2010-03-01        5
8         1 2010-03-15        5
9         1 2010-04-01        5
10        1 2010-04-15        5
11        1 2010-05-01        5
12        1 2010-05-15        5

print windows

   company   end_date
0        0 2010-02-01
1        0 2010-03-15
2        1 2010-04-01
3        1 2010-05-15

Create a beginning date for the 30 day windows:

windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') -
                       np.timedelta64(30,'D'))
print windows

   company   end_date   beg_date
0        0 2010-02-01 2010-01-02
1        0 2010-03-15 2010-02-13
2        1 2010-04-01 2010-03-02
3        1 2010-05-15 2010-04-15

Now do a merge and then select based on if date falls within beg_date and end_date:

df = df.merge(windows,on='company',how='left')
df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)]
print df

    company       date  measure   end_date   beg_date
2         0 2010-01-15       10 2010-02-01 2010-01-02
4         0 2010-02-01       10 2010-02-01 2010-01-02
7         0 2010-02-15       10 2010-03-15 2010-02-13
9         0 2010-03-01       10 2010-03-15 2010-02-13
11        0 2010-03-15       10 2010-03-15 2010-02-13
16        1 2010-03-15        5 2010-04-01 2010-03-02
18        1 2010-04-01        5 2010-04-01 2010-03-02
21        1 2010-04-15        5 2010-05-15 2010-04-15
23        1 2010-05-01        5 2010-05-15 2010-04-15
25        1 2010-05-15        5 2010-05-15 2010-04-15

You can compute the 30 day window sums by grouping on company and end_date:

print df.groupby(['company','end_date']).sum()

                    measure
company end_date           
0       2010-02-01       20
        2010-03-15       30
1       2010-04-01       10
        2010-05-15       15

Option #2 Move all merging into a groupby. This should be better on memory but I would think much slower:

windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') -
                       np.timedelta64(30,'D'))

def cond_merge(g,windows):
    g = g.merge(windows,on='company',how='left')
    g = g[(g.date >= g.beg_date) & (g.date <= g.end_date)]
    return g.groupby('end_date')['measure'].sum()

print df.groupby('company').apply(cond_merge,windows)

company  end_date  
0        2010-02-01    20
         2010-03-15    30
1        2010-04-01    10
         2010-05-15    15

Another option Now if your windows never overlap (like in the example data), you could do something like the following as an alternative that doesn’t blow up a dataframe but is pretty fast:

windows['date'] = windows['end_date']

df = df.merge(windows,on=['company','date'],how='outer')
print df

    company       date  measure   end_date
0         0 2010-01-01       10        NaT
1         0 2010-01-15       10        NaT
2         0 2010-02-01       10 2010-02-01
3         0 2010-02-15       10        NaT
4         0 2010-03-01       10        NaT
5         0 2010-03-15       10 2010-03-15
6         0 2010-04-01       10        NaT
7         1 2010-03-01        5        NaT
8         1 2010-03-15        5        NaT
9         1 2010-04-01        5 2010-04-01
10        1 2010-04-15        5        NaT
11        1 2010-05-01        5        NaT
12        1 2010-05-15        5 2010-05-15

This merge essentially inserts your window end dates into the dataframe and then backfilling the end dates (by group) will give you a structure to easily create you summation windows:

df['end_date'] = df.groupby('company')['end_date'].apply(lambda x: x.bfill())

print df

    company       date  measure   end_date
0         0 2010-01-01       10 2010-02-01
1         0 2010-01-15       10 2010-02-01
2         0 2010-02-01       10 2010-02-01
3         0 2010-02-15       10 2010-03-15
4         0 2010-03-01       10 2010-03-15
5         0 2010-03-15       10 2010-03-15
6         0 2010-04-01       10        NaT
7         1 2010-03-01        5 2010-04-01
8         1 2010-03-15        5 2010-04-01
9         1 2010-04-01        5 2010-04-01
10        1 2010-04-15        5 2010-05-15
11        1 2010-05-01        5 2010-05-15
12        1 2010-05-15        5 2010-05-15

df = df[df.end_date.notnull()]
df['beg_date'] = (df['end_date'].values.astype('datetime64[D]') -
                   np.timedelta64(30,'D'))

print df

   company       date  measure   end_date   beg_date
0         0 2010-01-01       10 2010-02-01 2010-01-02
1         0 2010-01-15       10 2010-02-01 2010-01-02
2         0 2010-02-01       10 2010-02-01 2010-01-02
3         0 2010-02-15       10 2010-03-15 2010-02-13
4         0 2010-03-01       10 2010-03-15 2010-02-13
5         0 2010-03-15       10 2010-03-15 2010-02-13
7         1 2010-03-01        5 2010-04-01 2010-03-02
8         1 2010-03-15        5 2010-04-01 2010-03-02
9         1 2010-04-01        5 2010-04-01 2010-03-02
10        1 2010-04-15        5 2010-05-15 2010-04-15
11        1 2010-05-01        5 2010-05-15 2010-04-15
12        1 2010-05-15        5 2010-05-15 2010-04-15

df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)]
print df.groupby(['company','end_date']).sum()

                    measure
company end_date           
0       2010-02-01       20
        2010-03-15       30
1       2010-04-01       10
        2010-05-15       15

Another alternative is to resample your first dataframe to daily data and then compute rolling_sums with a 30 day window; and select the dates at the end that you are interested in. This could be quite memory intensive too.

Leave a Comment