Why is allow.cartesian required at times when when joining data.tables with duplicate keys?

You don’t have to avoid duplicate keys. As long as the result does not get bigger than max(nrow(x), nrow(i)), you won’t get this error, even if you’ve duplicates. It is basically a precautionary measure.

When you’ve duplicate keys, the resulting join can sometimes get much bigger. Since data.table knows the total number of rows that’ll result from this join early enough, it provides this error message and asks you to use the argument allow.cartesian=TRUE if you’re really sure.

Here’s an (exaggerated) example that illustrates the idea behind this error message:

require(data.table)
DT1 <- data.table(x=rep(letters[1:2], c(1e2, 1e7)), 
                  y=1L, key="x")
DT2 <- data.table(x=rep("b", 3), key="x")

# not run
# DT1[DT2] ## error

dim(DT1[DT2, allow.cartesian=TRUE])
# [1] 30000000        2

The duplicates in DT2 resulted in 3 times the total number of “a” in DT1 (=1e7). Imagine if you performed the join with 1e4 values in DT2, the results would explode! To avoid this, there’s the allow.cartesian argument which by default is FALSE.

That being said, I think Matt once mentioned that it maybe possible to just provide the error in case of “large” joins (or joins that results in huge number of rows – which might be set arbitrarily I guess). This, when/if implemented, will make the join properly without this error message in case of joins that don’t combinatorially explode.

Leave a Comment