Which data.table syntax for left join (one column) to prefer

I prefer the “update join” idiom for efficiency and maintainability:**

DT[WHERE, v := FROM[.SD, on=, x.v]]

It’s an extension of what is shown in vignette("datatable-reference-semantics") under “Update some rows of columns by reference – sub-assign by reference”. Once there is a vignette available on joins, that should also be a good reference.

This is efficient since it only uses the rows selected by WHERE and modifies or adds the column in-place, instead of making a new table like the more concise left join FROM[DT, on=].

It makes my code more readable since I can easily see that the point of the join is to add column v; and I don’t have to think through “left”https://stackoverflow.com/”right” jargon from SQL or whether the number of rows is preserved after the join.

It is useful for code maintenance since if I later want to find out how DT got a column named v, I can search my code for v :=, while FROM[DT, on=] obscures which new columns are being added. Also, it allows the WHERE condition, while the left join does not. This may be useful, for example, if using FROM to “fill” NAs in an existing column v.

Compared with the other update join approach DT[FROM, on=, v := i.v], I can think of two advantages. First is the option of using the WHERE clause, and second is transparency through warnings when there are problems with the join, like duplicate matches in FROM conditional on the on= rules. Here’s an illustration extending the OP’s example:

library(data.table)
A <- data.table(id = letters[1:10], amount = rnorm(10)^2)
B2 <- data.table(
  id = c("c", "d", "e", "e"), 
  ord = 1:4, 
  comment = c("big", "slow", "nice", "nooice")
)

# left-joiny update
A[B2, on=.(id), comment := i.comment, verbose=TRUE]
# Calculated ad hoc index in 0.000s elapsed (0.000s cpu) 
# Starting bmerge ...done in 0.000s elapsed (0.000s cpu) 
# Detected that j uses these columns: comment,i.comment 
# Assigning to 4 row subset of 10 rows

# my preferred update
A[, comment2 := B2[A, on=.(id), x.comment]]
# Warning message:
# In `[.data.table`(A, , `:=`(comment2, B2[A, on = .(id), x.comment])) :
#   Supplied 11 items to be assigned to 10 items of column 'comment2' (1 unused)

    id     amount comment comment2
 1:  a 0.20000990    <NA>     <NA>
 2:  b 1.42146573    <NA>     <NA>
 3:  c 0.73047544     big      big
 4:  d 0.04128676    slow     slow
 5:  e 0.82195377  nooice     nice
 6:  f 0.39013550    <NA>   nooice
 7:  g 0.27019768    <NA>     <NA>
 8:  h 0.36017876    <NA>     <NA>
 9:  i 1.81865721    <NA>     <NA>
10:  j 4.86711754    <NA>     <NA>

In the left-join-flavored update, you silently get the final value of comment even though there are two matches for id == "e"; while in the other update, you get a helpful warning message (upgraded to an error in a future release). Even turning on verbose=TRUE with the left-joiny approach is not informative — it says there are four rows being updated but doesn’t say that one row is being updated twice.

I find that this approach works best when my data is arranged into a set of tidy/relational tables. A good reference on that is Hadley Wickham’s paper.

** In this idiom, the on= part should be filled in with the join column names and rules, like on=.(id) or on=.(from_date >= dt_date). Further join rules can be passed with roll=, mult= and nomatch=. See ?data.table for details. Thanks to @RYoda for noting this point in the comments.

Here is a more complicated example from Matt Dowle explaining roll=: Find time to nearest occurrence of particular value for each row

Another related example: Left join using data.table

More Related Contents:

Leave a Comment Cancel reply