If your real data is as regular as your sample data we can be quite efficient by noticing that reshaping a matrix is really just changing its dim attribute.
1st on very small data
library(data.table)
library(microbenchmark)
library(tidyr)
matrix_spread <- function(df1, key, value){
unique_ids <- unique(df1[[key]])
mat <- matrix( df1[[value]], ncol= length(unique_ids),byrow = TRUE)
df2 <- data.frame(unique(df1["tms"]),mat)
names(df2)[-1] <- paste0(value,".",unique_ids)
df2
}
n <- 3
t1 <- 4
df1 <- expand.grid(id=1:n, tms=as.POSIXct(1:t1, origin="1970-01-01"))
df1$y <- rnorm(nrow(df1))
reshape(df1, idvar="tms", timevar="id", direction="wide")
# tms y.1 y.2 y.3
# 1 1970-01-01 01:00:01 0.3518667 0.6350398 0.1624978
# 4 1970-01-01 01:00:02 0.3404974 -1.1023521 0.5699476
# 7 1970-01-01 01:00:03 -0.4142585 0.8194931 1.3857788
# 10 1970-01-01 01:00:04 0.3651138 -0.9867506 1.0920621
matrix_spread(df1, "id", "y")
# tms y.1 y.2 y.3
# 1 1970-01-01 01:00:01 0.3518667 0.6350398 0.1624978
# 4 1970-01-01 01:00:02 0.3404974 -1.1023521 0.5699476
# 7 1970-01-01 01:00:03 -0.4142585 0.8194931 1.3857788
# 10 1970-01-01 01:00:04 0.3651138 -0.9867506 1.0920621
all.equal(check.attributes = FALSE,
reshape(df1, idvar="tms", timevar="id", direction="wide"),
matrix_spread (df1, "id", "y"))
# TRUE
Then on bigger data
(sorry I can’t afford to make a huge computation now)
n <- 100
t1 <- 5000
df1 <- expand.grid(id=1:n, tms=as.POSIXct(1:t1, origin="1970-01-01"))
df1$y <- rnorm(nrow(df1))
DT1 <- as.data.table(df1)
microbenchmark(reshape=reshape(df1, idvar="tms", timevar="id", direction="wide"),
dcast=dcast(df1, tms ~ id, value.var="y"),
dcast.dt=dcast(DT1, tms ~ id, value.var="y"),
tidyr=spread(df1, id, y),
matrix_spread = matrix_spread(df1, "id", "y"),
times=3L)
# Unit: milliseconds
# expr min lq mean median uq max neval
# reshape 4197.08012 4240.59316 4260.58806 4284.10620 4292.34203 4300.57786 3
# dcast 57.31247 78.16116 86.93874 99.00986 101.75189 104.49391 3
# dcast.dt 114.66574 120.19246 127.51567 125.71919 133.94064 142.16209 3
# tidyr 55.12626 63.91142 72.52421 72.69658 81.22319 89.74980 3
# matrix_spread 15.00522 15.42655 17.45283 15.84788 18.67664 21.50539 3
Not too bad!
About memory usage, I guess if reshape
handles it my solution will, if you can work with my assumptions or preprocess the data to meet them:
- data is sorted
- we have 3 columns only
- for all id values we find all tms values