pandas.merge()
is the underlying function used for all merge/join behavior.
DataFrames provide the pandas.DataFrame.merge()
and pandas.DataFrame.join()
methods as a convenient way to access the capabilities of pandas.merge()
. For example, df1.merge(right=df2, ...)
is equivalent to pandas.merge(left=df1, right=df2, ...)
.
These are the main differences between df.join()
and df.merge()
:
- lookup on right table:
df1.join(df2)
always joins via the index ofdf2
, butdf1.merge(df2)
can join to one or more columns ofdf2
(default) or to the index ofdf2
(withright_index=True
). - lookup on left table: by default,
df1.join(df2)
uses the index ofdf1
anddf1.merge(df2)
uses column(s) ofdf1
. That can be overridden by specifyingdf1.join(df2, on=key_or_keys)
ordf1.merge(df2, left_index=True)
. - left vs inner join:
df1.join(df2)
does a left join by default (keeps all rows ofdf1
), butdf.merge
does an inner join by default (returns only matching rows ofdf1
anddf2
).
So, the generic approach is to use pandas.merge(df1, df2)
or df1.merge(df2)
. But for a number of common situations (keeping all rows of df1
and joining to an index in df2
), you can save some typing by using df1.join(df2)
instead.
Some notes on these issues from the documentation at http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging:
merge
is a function in the pandas namespace, and it is also
available as a DataFrame instance method, with the calling DataFrame
being implicitly considered the left object in the join.The related
DataFrame.join
method, usesmerge
internally for the
index-on-index and index-on-column(s) joins, but joins on indexes by
default rather than trying to join on common columns (the default
behavior formerge
). If you are joining on index, you may wish to
useDataFrame.join
to save yourself some typing.
…
These two function calls are completely equivalent:
left.join(right, on=key_or_keys) pd.merge(left, right, left_on=key_or_keys, right_index=True, how='left', sort=False)