Fastest way to add rows for missing time steps?

Following up on comments with Ben Barnes and starting with his mydf3 :

DT = as.data.table(mydf3)
setkey(DT,Id,Time)
DT[CJ(unique(Id),seq(min(Time),max(Time)))]
      Id Time        Value Id2
 [1,]  1    1 -0.262482283   2
 [2,]  1    2 -1.423935165   2
 [3,]  1    3  0.500523295   1
 [4,]  1    4 -1.912687398   1
 [5,]  1    5 -1.459766444   2
 [6,]  1    6 -0.691736451   1
 [7,]  1    7           NA  NA
 [8,]  1    8  0.001041489   2
 [9,]  1    9  0.495820559   2
[10,]  1   10 -0.673167744   1
First 10 rows of 12800 printed. 

setkey(DT,Id,Id2,Time)
DT[CJ(unique(Id),unique(Id2),seq(min(Time),max(Time)))]
      Id Id2 Time      Value
 [1,]  1   1    1         NA
 [2,]  1   1    2         NA
 [3,]  1   1    3  0.5005233
 [4,]  1   1    4 -1.9126874
 [5,]  1   1    5         NA
 [6,]  1   1    6 -0.6917365
 [7,]  1   1    7         NA
 [8,]  1   1    8         NA
 [9,]  1   1    9         NA
[10,]  1   1   10 -0.6731677
First 10 rows of 25600 printed. 

CJ stands for Cross Join, see ?CJ. The padding with NAs happens because nomatch by default is NA. Set nomatch to 0 instead to remove the no matches. If instead of padding with NAs the prevailing row is required, just add roll=TRUE. This can be more efficient than padding with NAs and then filling NAs afterwards. See the description of roll in ?data.table.

setkey(DT,Id,Time)
DT[CJ(unique(Id),seq(min(Time),max(Time))),roll=TRUE]
      Id Time        Value Id2
 [1,]  1    1 -0.262482283   2
 [2,]  1    2 -1.423935165   2
 [3,]  1    3  0.500523295   1
 [4,]  1    4 -1.912687398   1
 [5,]  1    5 -1.459766444   2
 [6,]  1    6 -0.691736451   1
 [7,]  1    7 -0.691736451   1
 [8,]  1    8  0.001041489   2
 [9,]  1    9  0.495820559   2
[10,]  1   10 -0.673167744   1
First 10 rows of 12800 printed. 

setkey(DT,Id,Id2,Time)
DT[CJ(unique(Id),unique(Id2),seq(min(Time),max(Time))),roll=TRUE]
      Id Id2 Time      Value
 [1,]  1   1    1         NA
 [2,]  1   1    2         NA
 [3,]  1   1    3  0.5005233
 [4,]  1   1    4 -1.9126874
 [5,]  1   1    5 -1.9126874
 [6,]  1   1    6 -0.6917365
 [7,]  1   1    7 -0.6917365
 [8,]  1   1    8 -0.6917365
 [9,]  1   1    9 -0.6917365
[10,]  1   1   10 -0.6731677
First 10 rows of 25600 printed. 

Instead of setting keys, you may use on. CJ also takes a unique argument. A small example with two ‘Id’:

d <- data.table(Id = rep(1:2, 4:3), Time = c(1, 2, 4, 5, 2, 3, 4), val = 1:7)

d[CJ(Id, Time = seq(min(Time), max(Time)), unique = TRUE), on = .(Id, Time)]
#     Id Time val
# 1:   1    1   1
# 2:   1    2   2
# 3:   1    3  NA
# 4:   1    4   3
# 5:   1    5   4
# 6:   2    1  NA
# 7:   2    2   5
# 8:   2    3   6
# 9:   2    4   7
# 10:  2    5  NA

In this particular case, where one of the vectors in CJ was generated with seq, the result needs to be named explictly in order to match the names specified in on. When using bare variables in CJ though (like ‘Id’ here), they are auto-named, like in data.table() (from data.table 1.12.2).

Leave a Comment