data.table - w3toppers.com

Fastest way to replace NAs in a large data.table

Here’s a solution using data.table‘s := operator, building on Andrie and Ramnath’s answers. require(data.table) # v1.6.6 require(gdata) # v2.8.2 set.seed(1) dt1 = create_dt(2e5, 200, 0.1) dim(dt1) [1] 200000 200 # more columns than Ramnath’s answer which had 5 not 200 f_andrie = function(dt) remove_na(dt) f_gdata = function(dt, un = 0) gdata::NAToUnknown(dt, un) f_dowle = function(dt) … Read more

Why does data.table update names(DT) by reference, even if I assign to another variable?

Update: This is now added in the documentation for ?copy in version 1.9.3. From NEWS: Moved ?copy to it’s own help page, and documented that dt_names <- copy(names(DT)) is necessary for dt_names to be not modified by reference as a result of updating DT by reference (ex: adding a new column by reference). Closes #512. … Read more

Getting the top values by group

From dplyr 1.0.0, “slice_min() and slice_max() select the rows with the minimum or maximum values of a variable, taking over from the confusing top_n().“ d %>% group_by(grp) %>% slice_max(order_by = x, n = 5) # # A tibble: 15 x 2 # # Groups: grp [3] # x grp # <dbl> <fct> # 1 0.994 … Read more

dcast warning: ‘Aggregation function missing: defaulting to length’

The reason why you are getting this warning is in the description of fun.aggregate (see ?dcast): aggregation function needed if variables do not identify a single observation for each output cell. Defaults to length (with a message) if needed but not specified So, an aggregation function is needed when there is more than one value … Read more

Is there a dplyr equivalent to data.table::rleid?

You can just do (when you have both data.table and dplyr loaded): DT <- DT %>% mutate(rlid = rleid(grp)) this gives: > DT grp value rlid 1: A 1 1 2: A 2 1 3: B 3 2 4: B 4 2 5: C 5 3 6: C 6 3 7: C 7 3 8: … Read more

Apply a function to every specified column in a data.table and update by reference

This seems to work: dt[ , (cols) := lapply(.SD, “*”, -1), .SDcols = cols] The result is a b d 1: -1 -1 1 2: -2 -2 2 3: -3 -3 3 There are a few tricks here: Because there are parentheses in (cols) :=, the result is assigned to the columns specified in cols, … Read more

data.table vs dplyr: can one do something well the other can’t or does poorly?

We need to cover at least these aspects to provide a comprehensive answer/comparison (in no particular order of importance): Speed, Memory usage, Syntax and Features. My intent is to cover each one of these as clearly as possible from data.table perspective. Note: unless explicitly mentioned otherwise, by referring to dplyr, we refer to dplyr’s data.frame … Read more

How to create a lag variable within each group?

You could do this within data.table library(data.table) data[, lag.value:=c(NA, value[-.N]), by=groups] data # time groups value lag.value #1: 1 a 0.02779005 NA #2: 2 a 0.88029938 0.02779005 #3: 3 a -1.69514201 0.88029938 #4: 1 b -1.27560288 NA #5: 2 b -0.65976434 -1.27560288 #6: 3 b -1.37804943 -0.65976434 #7: 4 b 0.12041778 -1.37804943 For multiple columns: … Read more

Group by multiple columns and sum other multiple columns

The data.table way is : DT[, lapply(.SD,sum), by=list(col1,col2,col3,…)] or DT[, lapply(.SD,sum), by=colnames(DT)[1:10]] where .SD is the (S)ubset of (D)ata excluding group columns. (Aside: If you need to refer to group columns generically, they are in .BY.)

Select / assign to data.table when variable names are stored in a character vector

Two ways to programmatically select variable(s): with = FALSE: DT = data.table(col1 = 1:3) colname = “col1” DT[, colname, with = FALSE] # col1 # 1: 1 # 2: 2 # 3: 3 ‘dot dot’ (..) prefix: DT[, ..colname] # col1 # 1: 1 # 2: 2 # 3: 3 For further description of the … Read more