First off, two comments:
- Your
dput
throws errors. Firstly, you are missing single ticks for the date column names; secondly there seems to be an error forOUT_TIME_date
ofCASEID = 101571(5)
. For future posts, please double-check that sample data (1) doesn’t contain R syntax errors, and (2) is correct & representative. - There are also mistakes in your expected output, where you don’t include the last day for stays that extend past one year.
Here is one option making use of lubridate
functions to deal with Date
s
library(tidyverse)
library(lubridate)
df %>%
mutate_at(vars(ends_with("date")), ~as.Date(.x, format = "%Y-%m-%d")) %>%
mutate(
LOS = difftime(OUT_TIME_date, IN_TIME_date, units = "days"),
grp = map2(
IN_TIME_date,
OUT_TIME_date,
~year(seq.Date(.x, .y - 1, by = "day")))) %>%
unnest() %>%
group_by_all() %>%
tally() %>%
spread(grp, n, fill = 0) %>%
ungroup() %>%
as.data.frame()
# CASEID IN_TIME_date OUT_TIME_date LOS 2012 2013 2014 2015 2016
#1 023252(1) 2017-02-25 2017-03-02 5 days 0 0 0 0 0
#2 07597558(2) 2015-10-23 2015-12-05 43 days 0 0 0 43 0
#3 07597558(3) 2016-07-06 2016-07-15 9 days 0 0 0 0 9
#4 100520(31) 2013-01-23 2013-01-28 5 days 0 5 0 0 0
#5 100520(32) 2013-03-12 2013-03-18 6 days 0 6 0 0 0
#6 100520(33) 2013-06-13 2013-06-18 5 days 0 5 0 0 0
#7 100520(34) 2013-10-08 2013-10-15 7 days 0 7 0 0 0
#8 10056(1) 2016-02-20 2016-02-29 9 days 0 0 0 0 9
#9 101171(4) 2015-09-24 2015-10-19 25 days 0 0 0 25 0
#10 101171(5) 2015-10-19 2015-11-02 14 days 0 0 0 14 0
#11 101455(2) 2014-05-01 2014-05-28 27 days 0 0 27 0 0
#12 101557(2) 2015-12-11 2016-01-15 35 days 0 0 0 21 14
#13 101571(3) 2014-08-26 2015-07-21 329 days 0 0 128 201 0
#14 101571(4) 2015-07-21 2016-01-06 169 days 0 0 0 164 5
#15 101571(5) 2016-01-06 2018-03-20 804 days 0 0 0 0 361
#16 101571(6) 2017-03-20 2017-12-14 269 days 0 0 0 0 0
#17 10160(5) 2014-04-14 2014-06-14 61 days 0 0 61 0 0
#18 101637(2) 2017-04-25 2017-05-09 14 days 0 0 0 0 0
#19 101893(13) 2014-08-10 2014-08-21 11 days 0 0 11 0 0
#20 101893(15) 2017-02-06 2017-02-11 5 days 0 0 0 0 0
#21 101893(16) 2017-04-12 2017-04-20 8 days 0 0 0 0 0
#22 102807(4) 2016-01-19 2016-01-24 5 days 0 0 0 0 5
#23 102807(5) 2016-06-08 2016-06-15 7 days 0 0 0 0 7
#24 102862(12) 2012-10-19 2013-01-25 98 days 74 24 0 0 0
# 2017 2018
#1 5 0
#2 0 0
#3 0 0
#4 0 0
#5 0 0
#6 0 0
#7 0 0
#8 0 0
#9 0 0
#10 0 0
#11 0 0
#12 0 0
#13 0 0
#14 0 0
#15 365 78
#16 269 0
#17 0 0
#18 14 0
#19 0 0
#20 5 0
#21 8 0
#22 0 0
#23 0 0
#24 0 0
The idea is to generate a day-by-day sequence of dates based on IN_TIME_date
and OUT_TIME_date
; we then extract only the years from those sequences and tally the number of years per CASEID
. The rest is basic tidying/reshaping of your data.
df <- structure(list(CASEID = c("023252(1)", "07597558(2)", "07597558(3)",
"100520(31)", "100520(32)", "100520(33)", "100520(34)", "10056(1)",
"101171(4)", "101171(5)", "101455(2)", "101557(2)", "101571(3)",
"101571(4)", "101571(5)", "101571(6)", "10160(5)", "101637(2)",
"101893(13)", "101893(15)", "101893(16)", "102807(4)", "102807(5)",
"102862(12)"), IN_TIME_date = c("2017-02-25", "2015-10-23", "2016-07-06",
"2013-01-23", "2013-03-12", "2013-06-13", "2013-10-08", "2016-02-20",
"2015-09-24", "2015-10-19", "2014-05-01", "2015-12-11", "2014-08-26",
"2015-07-21", "2016-01-06", "2017-03-20", "2014-04-14", "2017-04-25",
"2014-08-10", "2017-02-06", "2017-04-12", "2016-01-19", "2016-06-08",
"2012-10-19"), OUT_TIME_date = c("2017-03-02", "2015-12-05",
"2016-07-15", "2013-01-28", "2013-03-18", "2013-06-18", "2013-10-15",
"2016-02-29", "2015-10-19", "2015-11-02", "2014-05-28", "2016-01-15",
"2015-07-21", "2016-01-06", "2018-03-20", "2017-12-14", "2014-06-14",
"2017-05-09", "2014-08-21", "2017-02-11", "2017-04-20", "2016-01-24",
"2016-06-15", "2013-01-25")), row.names = c(NA, -24L), class = "data.frame")