Merging two tables with millions of rows in Python

This is a little pseudo codish, but I think should be quite fast.

Straightforward disk based merge, with all tables on disk. The
key is that you are not doing selection per se, just indexing
into the table via start/stop, which is quite fast.

Selecting the rows that meet a criteria in B (using A’s ids) won’t
be very fast, because I think it might be bringing the data into Python space
rather than an in-kernel search (I am not sure, but you might want
to investigate on pytables.org more in the in-kernel optimization section.
There is a way to tell if it’s going to be in-kernel or not).

Also if you are up to it, this is a very parallel problem (just don’t write
the results to the same file from multiple processes. pytables is not write-safe for that).

See this answer for a comment on how doing a join operation will actually be an ‘inner’ join.

For your merge_a_b operation I think you can use a standard pandas join
which is quite efficient (when in-memory).

One other option (depending on how ‘big’ A) is, might be to separate A into 2 pieces (that are indexed the same), using a smaller (maybe use single column) in the first table; instead of storing the merge results per se, store the row index; later you can pull out the data you need (kind of like using an indexer and take). See http://pandas.pydata.org/pandas-docs/stable/io.html#multiple-table-queries

A = HDFStore('A.h5')
B = HDFStore('B.h5')

nrows_a = A.get_storer('df').nrows
nrows_b = B.get_storer('df').nrows
a_chunk_size = 1000000
b_chunk_size = 1000000

def merge_a_b(a,b):
    # Function that returns an operation on passed
    # frames, a and b.
    # It could be a merge, join, concat, or other operation that
    # results in a single frame.


for a in xrange(int(nrows_a / a_chunk_size) + 1):

    a_start_i = a * a_chunk_size
    a_stop_i  = min((a + 1) * a_chunk_size, nrows_a)

    a = A.select('df', start = a_start_i, stop = a_stop_i)

    for b in xrange(int(nrows_b / b_chunk_size) + 1):

        b_start_i = b * b_chunk_size
        b_stop_i = min((b + 1) * b_chunk_size, nrows_b)

        b = B.select('df', start = b_start_i, stop = b_stop_i)

        # This is your result store
        m = merge_a_b(a, b)

        if len(m):
            store.append('df_result', m)

Leave a Comment