As others mentioned, strptime
(converting from character to POSIXlt) is the bottleneck here. Another simple solution uses the lubridate
package and its fast_strptime
method instead.
Here’s what it looks like on my data:
> tables()
NAME NROW MB COLS
[1,] pp 3,718,339 126 session_id,date,user_id,path,num_sessions
KEY
[1,] user_id,date
Total: 126MB
> pp[, 2]
date
1: 2013-09-25
2: 2013-09-25
3: 2013-09-25
4: 2013-09-25
5: 2013-09-25
---
3718335: 2013-09-25
3718336: 2013-09-25
3718337: 2013-09-25
3718338: 2013-10-11
3718339: 2013-10-11
> system.time(pp[, date := as.Date(fast_strptime(date, "%Y-%m-%d"))])
user system elapsed
0.315 0.026 0.344
For comparison:
> system.time(pp[, date := as.Date(date, "%Y-%m-%d")])
user system elapsed
108.193 0.399 108.844
That’s ~316 times faster!