Pandas analogue to SQL MINUS / EXCEPT operator, using multiple columns

Execution time comparison for larger data sets:

In [100]: df1 = pd.concat([d1] * 10**5, ignore_index=True)

In [101]: df2 = pd.concat([d2] * 10**5, ignore_index=True)

In [102]: df1.shape
Out[102]: (700000, 3)

In [103]: df2.shape
Out[103]: (300000, 3)

pd.concat().drop_duplicates() approach:

In [10]: %%timeit
    ...: res = pd.concat([d1, pd.concat([d2]*2)]).drop_duplicates(['a', 'b'], keep=False)
    ...:
    ...:
2.59 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

multi-index NOT IS IN approach:

In [11]: %%timeit
    ...: res = df1[~df1.set_index(["a", "b"]).index.isin(df2.set_index(["a","b"]).index)]
    ...:
    ...:
484 ms ± 18.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

multi-index difference approach:

In [12]: %%timeit
    ...: tmp1 = df1.reset_index().set_index(["a", "b"])
    ...: idx = tmp1.index.difference(df2.set_index(["a","b"]).index)
    ...: res = df1.loc[tmp1.loc[idx, "index"]]
    ...:
    ...:
1.04 s ± 20.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

merge(how="outer") approach – gives me a MemoryError:

In [106]: %%timeit
     ...: res =  (df1.reset_index()
     ...:         .merge(df2, on=['a','b'], indicator=True, how='outer', suffixes=('','_'))
     ...:         .query('_merge == "left_only"')
     ...:         .set_index('index')
     ...:         .rename_axis(None)
     ...:         .reindex(df1.columns, axis=1))
     ...:
     ...:
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)

compare concatenated strings approach:

In [13]: %%timeit
    ...: res = df1[~df1[['a','b']].astype(str).sum(axis=1).isin(df2[['a','b']].astype(str).sum(axis=1))]
    ...:
    ...:
2.05 s ± 65.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Leave a Comment