To quote from the data.table
FAQ 1.11 What is the difference between X[Y]
and merge(X, Y)
?
X[Y]
is a join, looking up X’s rows using Y (or Y’s key if it has one) as an index.
Y[X]
is a join, looking up Y’s rows using X (or X’s key if it has one)
merge(X,Y)
does both ways at the same time. The number of rows ofX[Y]
andY[X]
usually differ, whereas the number of rows returned bymerge(X,Y)
andmerge(Y,X)
is the same.BUT that misses the main point. Most tasks require something to be done on the
data after a join or merge. Why merge all the columns of data, only to
use a small subset of them afterwards? You may suggest
merge(X[,ColsNeeded1],Y[,ColsNeeded2])
, but that requires the programmer to work out which columns are needed.X[Y,j
] in data.table does all that in one step for
you. When you writeX[Y,sum(foo*bar)]
, data.table automatically inspects thej
expression to see which columns it uses. It will only subset those columns only; the others are ignored. Memory is only created for the columns thej
uses, andY
columns enjoy standard R recycling rules within the context of each group. Let’s sayfoo
is inX
, and bar is inY
(along with 20 other columns inY
). Isn’tX[Y,sum(foo*bar)]
quicker to program and quicker to run than a merge of everything wastefully followed by a subset?
If you want a left outer join of X[Y]
le <- Y[X]
mallx <- merge(X, Y, all.x = T)
# the column order is different so change to be the same as `merge`
setcolorder(le, names(mallx))
identical(le, mallx)
# [1] TRUE
If you want a full outer join
# the unique values for the keys over both data sets
unique_keys <- unique(c(X[,t], Y[,t]))
Y[X[J(unique_keys)]]
## t b a
## 1: 1 NA 1
## 2: 2 NA 4
## 3: 3 9 9
## 4: 4 16 16
## 5: 5 25 NA
## 6: 6 36 NA
# The following will give the same with the column order X,Y
X[Y[J(unique_keys)]]