Pandas custom function to find whether it is the 1st, 2nd etc Monday, Tuesday, etc – all suggestions welcome

Try:

df['downum'] = df['date'].apply(downu)

and change downu to:

def downu(dtime):
  d = dtime.day      # use dtime.day rather than dtime.dt.day
  x = np.ceil(d/7)
  return int(x)      # cast to int since d/7 is float even after np.ceil()

df.apply() works on the whole df, i.e. all columns one by one in turn (column-wise). The index of each processing column Series is still the DataFrame index. Hence, the column label ‘date’ cannot be used as index to this intermediate Series being processed. You have to use apply() on the ‘date’ Series instead unless your downu() function can accept values of all columns and ignore irrelevant columns.

Edit:

Add alternate solutions using apply(... ,axis=1) and list(map(...))

Here are further solutions, some of which I think was OP’s originally attempted target way of coding. I will also discuss their pros and cons with respect to system performance (execution time) for large dataset and also for program readability (clarity).

Alternate solution 1:

%%timeit
df['downum'] = df.apply(lambda x: downu(x['date']), axis=1)

988 µs ± 8.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Alternate solution 2:

%%timeit
df['downum'] = df.apply(lambda x: downu(x.date), axis=1)

1.01 ms ± 13.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Alternate solution 3:

%%timeit
df['downum'] = list(map(downu, df['date']))

244 µs ± 3.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Original solution:

%%timeit 
df['downum'] = df['date'].apply(downu)

810 µs ± 484 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Alternate solutions 1 and 2 are supposed to be OP’s originally attempted target format of coding. The only difference is that axis=1 is added to make them workable. Now, with axis=1 added to the DataFrame.apply() function, downu(x['date'] and downu(x.date) can be used within the lambda function. Effectively, axis=1 modifies behavior of DataFrame.apply() function to allow the column index names be used. This can be better understood by thinking the function inside apply() is passed with a Series object, row-wise. The Series object has the original DataFrame column names/indices now become the Series index. Hence, you can access the elements in the same way as you access the Series elements by coding in format like series_obj[‘index’].

Comparison of solutions:

Comparing the execution time of the original solution (using pandas.Series.apply()) with the 2 alternate solutions using pandas.DataFrame.apply(... ,axis=1), the original solution is still a little bit faster. In terms of program readability, the original solution working on the df['date'] pandas Series is perceived to be simple and better.

In consideration of system performance, alternate solution 3 using list(map(...)) is 3x ~ 4x times faster than all other solutions. Note that this performance comparison result of DataFrame.apply(..., axis=1) vs list(map(..)) is generic rather than specific to this question. You can refer to this answer of the post How to apply a function to two columns of Pandas dataframe for a more in-depth discussion of the topic. Some other answers of that same post are also very useful for better understanding of the apply() function.

In summary, if the dataset is not large and system performance not a major consideration, use the original solution using pandas.Series.apply() in favor of program readability and clarity. Otherwise, for system performance consideration, using list(map(...)) is far superior to the pandas.DataFrame.apply(... ,axis=1) approach.

Leave a Comment