Why is as.Date slow on a character vector?

As others mentioned, strptime (converting from character to POSIXlt) is the bottleneck here. Another simple solution uses the lubridate package and its fast_strptime method instead.

Here’s what it looks like on my data:

> tables()
     NAME      NROW  MB COLS                                     
[1,] pp   3,718,339 126 session_id,date,user_id,path,num_sessions
     KEY         
[1,] user_id,date
Total: 126MB

> pp[, 2]
               date
      1: 2013-09-25
      2: 2013-09-25
      3: 2013-09-25
      4: 2013-09-25
      5: 2013-09-25
     ---           
3718335: 2013-09-25
3718336: 2013-09-25
3718337: 2013-09-25
3718338: 2013-10-11
3718339: 2013-10-11

> system.time(pp[, date := as.Date(fast_strptime(date, "%Y-%m-%d"))])
   user  system elapsed 
  0.315   0.026   0.344  

For comparison:

> system.time(pp[, date := as.Date(date, "%Y-%m-%d")])
   user  system elapsed 
108.193   0.399 108.844 

That’s ~316 times faster!

Leave a Comment