Fastest way to replace NAs in a large data.table

Here’s a solution using data.table‘s := operator, building on Andrie and Ramnath’s answers. require(data.table) # v1.6.6 require(gdata) # v2.8.2 set.seed(1) dt1 = create_dt(2e5, 200, 0.1) dim(dt1) [1] 200000 200 # more columns than Ramnath’s answer which had 5 not 200 f_andrie = function(dt) remove_na(dt) f_gdata = function(dt, un = 0) gdata::NAToUnknown(dt, un) f_dowle = function(dt) … Read more

Getting the top values by group

From dplyr 1.0.0, “slice_min() and slice_max() select the rows with the minimum or maximum values of a variable, taking over from the confusing top_n().“ d %>% group_by(grp) %>% slice_max(order_by = x, n = 5) # # A tibble: 15 x 2 # # Groups: grp [3] # x grp # <dbl> <fct> # 1 0.994 … Read more

data.table vs dplyr: can one do something well the other can’t or does poorly?

We need to cover at least these aspects to provide a comprehensive answer/comparison (in no particular order of importance): Speed, Memory usage, Syntax and Features. My intent is to cover each one of these as clearly as possible from data.table perspective. Note: unless explicitly mentioned otherwise, by referring to dplyr, we refer to dplyr’s data.frame … Read more

How to create a lag variable within each group?

You could do this within data.table library(data.table) data[, lag.value:=c(NA, value[-.N]), by=groups] data # time groups value lag.value #1: 1 a 0.02779005 NA #2: 2 a 0.88029938 0.02779005 #3: 3 a -1.69514201 0.88029938 #4: 1 b -1.27560288 NA #5: 2 b -0.65976434 -1.27560288 #6: 3 b -1.37804943 -0.65976434 #7: 4 b 0.12041778 -1.37804943 For multiple columns: … Read more