How can I fix corrupted dates in R?

You need none of the packages you’ve loaded, nor do you need to use Reduce, as functions we’re using here are naturally “vectorized”.

Here’s a sample of your data. (A good question includes data in an easily copied format such as this.)

x <- c("19/9/1997", "22/9/1997", "23/9/1997", "24/9/1997", "25/9/1997",
       "26/9/1997", "29/9/1997", "30/9/1997",
       "35440", "35471", "35499", "35591", "35621",
       "35652", "35683", "35713")

dates <- as.Date(x, format="%d/%m/%Y")
dates
#  [1] "1997-09-19" "1997-09-22" "1997-09-23" "1997-09-24" "1997-09-25"
#  [6] "1997-09-26" "1997-09-29" "1997-09-30" NA           NA          
# [11] NA           NA           NA           NA           NA          
# [16] NA          

Not surprisingly, the second-half of the dates are not recognized given format="%d/%m/%Y". You mentioned the use of "%m/%d/%Y" in your question, so we can (1) do a literal second-pass for this format (un-utilized with this example, but still relevant for your work?):

dates[is.na(dates)] <- as.Date(x[is.na(dates)], format="%m/%d/%Y")

where [is.na(dates)] only works on the un-converted elements.

(2) If we have more than one other format, you can always use a vector of them and loop over them. (For this, I’ll start over, since this loop would replace/augment the first steps above.)

formats <- c("%m/%d/%Y", "%d/%m/%Y", "%Y/%m/%d")
dates <- as.Date(rep(NA, length(x)))
for (fmt in formats) {
  nas <- is.na(dates)
  dates[nas] <- as.Date(x[nas], format=fmt)
}
dates
#  [1] "1997-09-19" "1997-09-22" "1997-09-23" "1997-09-24" "1997-09-25"
#  [6] "1997-09-26" "1997-09-29" "1997-09-30" NA           NA          
# [11] NA           NA           NA           NA           NA          
# [16] NA          

This still leaves us with NAs for the integer-looking ones. For these you need to specify the origin= to be able to figure it out (as well as converting to an integer). R typically works with an origin of "1970-01-01", which you can confirm with

as.integer(Sys.Date())
# [1] 17787
Sys.Date() - 17787
# [1] "1970-01-01"

but it appears that your dates have an origin of "1900-01-01", I think that’s Excel’s default storage of dates (but it doesn’t matter here):

x[9]    # the first integer-looking element
# [1] "35440"
dates[1] - as.integer(x[9])
# [1] "1900-09-08"

(I’m assuming that your dates are from the same relative period of time.)

From here:

nas <- is.na(dates)
dates[nas] <- as.Date(as.integer(x[nas]), origin="1900-01-01")
dates
#  [1] "1997-09-19" "1997-09-22" "1997-09-23" "1997-09-24" "1997-09-25"
#  [6] "1997-09-26" "1997-09-29" "1997-09-30" "1997-01-12" "1997-02-12"
# [11] "1997-03-12" "1997-06-12" "1997-07-12" "1997-08-12" "1997-09-12"
# [16] "1997-10-12"

(Working on the indices of only NA elements is relatively efficient in that it only works on and replaces the not-yet-matched entries. If there is nothing left when it gets to another call to as.Date, it does still call it but with an argument of length 0, with which the function works rather efficiently. I don’t think adding a conditional of if (any(nas)) ... would help, but if there are further methods you need that might be more “expensive”, you can consider it.)

Leave a Comment