What methods can we use to reshape VERY large data sets?

If your real data is as regular as your sample data we can be quite efficient by noticing that reshaping a matrix is really just changing its dim attribute.

1st on very small data

library(data.table)
library(microbenchmark)
library(tidyr)

matrix_spread <- function(df1, key, value){
  unique_ids <-  unique(df1[[key]])
  mat <- matrix( df1[[value]], ncol= length(unique_ids),byrow = TRUE)
  df2 <- data.frame(unique(df1["tms"]),mat)
  names(df2)[-1] <- paste0(value,".",unique_ids)
  df2
}

n <- 3      
t1 <- 4
df1 <- expand.grid(id=1:n, tms=as.POSIXct(1:t1, origin="1970-01-01"))
df1$y <- rnorm(nrow(df1))

reshape(df1, idvar="tms", timevar="id", direction="wide")
#                    tms        y.1        y.2       y.3
# 1  1970-01-01 01:00:01  0.3518667  0.6350398 0.1624978
# 4  1970-01-01 01:00:02  0.3404974 -1.1023521 0.5699476
# 7  1970-01-01 01:00:03 -0.4142585  0.8194931 1.3857788
# 10 1970-01-01 01:00:04  0.3651138 -0.9867506 1.0920621

matrix_spread(df1, "id", "y")
#                    tms        y.1        y.2       y.3
# 1  1970-01-01 01:00:01  0.3518667  0.6350398 0.1624978
# 4  1970-01-01 01:00:02  0.3404974 -1.1023521 0.5699476
# 7  1970-01-01 01:00:03 -0.4142585  0.8194931 1.3857788
# 10 1970-01-01 01:00:04  0.3651138 -0.9867506 1.0920621

all.equal(check.attributes = FALSE,
          reshape(df1, idvar="tms", timevar="id", direction="wide"),
          matrix_spread (df1, "id", "y"))
# TRUE

Then on bigger data

(sorry I can’t afford to make a huge computation now)

n <- 100      
t1 <- 5000

df1 <- expand.grid(id=1:n, tms=as.POSIXct(1:t1, origin="1970-01-01"))
df1$y <- rnorm(nrow(df1))

DT1 <- as.data.table(df1)

microbenchmark(reshape=reshape(df1, idvar="tms", timevar="id", direction="wide"),
               dcast=dcast(df1, tms ~ id, value.var="y"),
               dcast.dt=dcast(DT1, tms ~ id, value.var="y"),
               tidyr=spread(df1, id, y),
               matrix_spread = matrix_spread(df1, "id", "y"),
               times=3L)

# Unit: milliseconds
# expr                 min         lq       mean     median         uq        max neval
# reshape       4197.08012 4240.59316 4260.58806 4284.10620 4292.34203 4300.57786     3
# dcast           57.31247   78.16116   86.93874   99.00986  101.75189  104.49391     3
# dcast.dt       114.66574  120.19246  127.51567  125.71919  133.94064  142.16209     3
# tidyr           55.12626   63.91142   72.52421   72.69658   81.22319   89.74980     3
# matrix_spread   15.00522   15.42655   17.45283   15.84788   18.67664   21.50539     3 

Not too bad!

About memory usage, I guess if reshape handles it my solution will, if you can work with my assumptions or preprocess the data to meet them:

  • data is sorted
  • we have 3 columns only
  • for all id values we find all tms values

Leave a Comment