Detect at least one match between each data frame row and values in vector

This would be faster than an apply based solution (despite it’s cryptic construction):

as.numeric(rowSums(`dim<-`(as.matrix(df) %in% vec, dim(df))) >= 1)
[1] 1 0 1 0

Update — Some benchmarks

Here, we can make up some bigger data to test on…. These benchmarks are on 100k rows.

set.seed(1)
nrow <- 100000
ncol <- 10
vec <- c("a", "i", "s", "t", "z")
df <- data.frame(matrix(sample(c(letters, NA), nrow * ncol, TRUE),
                        nrow = nrow, ncol = ncol), stringsAsFactors = FALSE)

Here are the approaches we have so far:

AM <- function() as.numeric(rowSums(`dim<-`(as.matrix(df) %in% vec, dim(df))) >= 1)
NR1 <- function() {
  apply(df,1,function(x){
    if(any(x %in% vec)){ 
      1 
    } else {
      0
    }
  })
}
NR2 <- function() apply(df, 1, function(x) any(x %in% vec) + 0)
NR3 <- function() apply(df, 1, function(x) as.numeric(any(x %in% vec)))
NR4 <- function() apply(df, 1, function(x) any(x %in% vec) %/% TRUE)
NR5 <- function() apply(df, 1, function(x) cumprod(any(x %in% vec)))
RS1 <- function() as.numeric(grepl(paste(vec, collapse="|"), do.call(paste, df)))
RS2 <- function() as.numeric(seq(nrow(df)) %in% row(df)[unlist(df) %in% vec])

I’m suspecting the NR functions will be a little slower:

system.time(NR1()) # Other NR functions are about the same
#    user  system elapsed 
#   1.172   0.000   1.196

And, similarly, Richard’s second approach:

system.time(RS2())
#    user  system elapsed 
#   0.918   0.000   0.932

The grepl and this rowSum function are left for the benchmarks:

library(microbenchmark)
microbenchmark(AM(), RS1())
# Unit: milliseconds
#   expr       min       lq      mean    median       uq      max neval
#   AM()  65.75296  67.2527  92.03043  84.58111 102.3199 234.6114   100
#  RS1() 253.57360 256.6148 266.89640 260.18038 264.1531 385.6525   100

Update — Some benchmarks

More Related Contents:

Leave a Comment Cancel reply