finding close match from data frame 1 in data fame 2

You need a “non-equi” or “range” join. This is implemented in fuzzyjoin and data.table packages for R. Since it is also supported in SQL, one can also use sqldf.

Sadly, dplyr does not support this natively. Since this action is supported in SQL, if your data are in a database then dbplyr would allow it using its sql_on, but not natively.

First, we need to add in the 20% tolerance:

df_1$age_1_start <- df_1$age_1 * 0.8
df_1$age_1_end <- df_1$age_1 * 1.2
df_1
#   shift_1 level_1 age_1 length_1 age_1_start age_1_end
# 1       1       1   4.5      100        3.60      5.40
# 2       1       3   3.2      120        2.56      3.84
# 3       0       5   3.0        5        2.40      3.60
# 4       2       4   2.5       70        2.00      3.00

fuzzyjoin

fuzzyjoin::fuzzy_left_join(
  df_1, df_2,
  by = c("shift_1" = "shift_2", "level_1" = "level_2",
         "age_1_start" = "age_2", "age_1_end" = "age_2"),
  match_fun = list(`==`, `==`, `<=`, `>=`))
#   shift_1 level_1 age_1 length_1 age_1_start age_1_end shift_2 level_2 age_2 length_2
# 1       1       1   4.5      100        3.60      5.40      NA      NA    NA       NA
# 2       1       3   3.2      120        2.56      3.84       1       3   3.1      180
# 3       0       5   3.0        5        2.40      3.60      NA      NA    NA       NA
# 4       2       4   2.5       70        2.00      3.00       2       4   2.2       40

data.table

library(data.table)
DT_1 <- as.data.table(df_1) # must include age_1_start and age_1_end from above
DT_2 <- as.data.table(df_2)

DT_2[DT_1, on = .(shift_2 == shift_1, level_2 == level_1, age_2 >= age_1_start, age_2 <= age_1_end)]
#    shift_2 level_2 age_2 length_2 age_2.1 age_1 length_1
# 1:       1       1  3.60       NA    5.40   4.5      100
# 2:       1       3  2.56      180    3.84   3.2      120
# 3:       0       5  2.40       NA    3.60   3.0        5
# 4:       2       4  2.00       40    3.00   2.5       70

This package tends to rename the left (DT_1) join based on the right’s names, which may be frustrating. For this, you will need to do some cleanup afterwards.

sqldf

sqldf::sqldf(
  "select t1.*, t2.*
   from df_1 t1
     left join df_2 t2 on t1.shift_1 = t2.shift_2 and t1.level_1 = t2.level_2
       and t1.age_1_start <= t2.age_2 and t1.age_1_end >= t2.age_2")
#   shift_1 level_1 age_1 length_1 age_1_start age_1_end shift_2 level_2 age_2 length_2
# 1       1       1   4.5      100        3.60      5.40      NA      NA    NA       NA
# 2       1       3   3.2      120        2.56      3.84       1       3   3.1      180
# 3       0       5   3.0        5        2.40      3.60      NA      NA    NA       NA
# 4       2       4   2.5       70        2.00      3.00       2       4   2.2       40

If you know SQL, then the last might be the most intuitive and easiest to absorb. Keep in mind, though, that for larger frames, it is copying the entire frame into a memory-storage SQLite instance … which is not “free”.

The fuzzyjoin implementation gives you a lot of power, and its arguments seem (to me) to be easy to follow. The results are named as I would expect. However, it is the slowest (with this data) of the three implementations. (This should only be a concern if your real data is “very” large.)

If you don’t already know data.table, despite its blazing speed, its dialect of R can be a bit obscure to the uninformed. I believe it has as much power as fuzzyjoin, though I haven’t tested all corner-cases to see if one supports something the other does not.

bench::mark(
  fuzzyjoin = fuzzyjoin::fuzzy_left_join(
    df_1, df_2,
    by = c("shift_1" = "shift_2", "level_1" = "level_2",
           "age_1_start" = "age_2", "age_1_end" = "age_2"),
    match_fun = list(`==`, `==`, `<=`, `>=`)),
  data.table = DT_2[DT_1, on = .(shift_2 == shift_1, level_2 == level_1, age_2 >= age_1_start, age_2 <= age_1_end)],
  sqldf = sqldf::sqldf(
    "select t1.*, t2.*
     from df_1 t1
       left join df_2 t2 on t1.shift_1 = t2.shift_2 and t1.level_1 = t2.level_2
         and t1.age_1_start <= t2.age_2 and t1.age_1_end >= t2.age_2"),
  check = FALSE
)
# # A tibble: 3 x 13
#   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory        time      gc       
#   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>        <list>    <list>   
# 1 fuzzyjoin  134.12ms 143.24ms      6.98     107KB     6.98     2     2      286ms <NULL> <Rprofmem[,3~ <bch:tm ~ <tibble ~
# 2 data.table   2.14ms   2.63ms    335.       114KB     2.06   163     1      487ms <NULL> <Rprofmem[,3~ <bch:tm ~ <tibble ~
# 3 sqldf       21.14ms  22.72ms     42.9      184KB     4.52    19     2      442ms <NULL> <Rprofmem[,3~ <bch:tm ~ <tibble ~

fuzzyjoin

data.table

sqldf

More Related Contents:

Leave a Comment Cancel reply