As another base R solution, here is a poor man’s na.locf
fill_down <- function(v) {
if (length(v) > 1) {
keep <- c(TRUE, !is.na(v[-1]))
v[keep][cumsum(keep)]
} else v
}
To fill down by group, the approach is to use tapply()
to split and apply to each group, and split<-
to combine groups to the original geometry, as
fill_down_by_group <- function(v, grp) {
## original 'by hand':
## split(v, grp) <- tapply(v, grp, fill_down)
## v
## done by built-in function `ave()`
ave(v, grp, FUN=fill_down)
}
To process multiple columns, one might
elts <- c("age", "birthplace")
df[elts] <- lapply(df[elts], fill_down_by_group, df$name)
Notes
-
I would be interested in seeing how a dplyr solution handles many columns, without hard-coding each? Answering my own question, I guess this is
library(dplyr); library(tidyr) df %>% group_by(name) %>% fill_(elts)
-
A more efficient base solution when the groups are already ‘grouped’ (e.g.,
identical(grp, sort(grp))
) isfill_down_by_grouped <- function(v, grp) { if (length(v) > 1) { keep <- !(duplicated(v) & is.na(v)) v[keep][cumsum(keep)] } else v }
-
For me,
fill_down()
on a vector with about 10M elements takes ~225ms;fill_down_by_grouped()
takes ~300ms independent of the number of groups;fill_down_by_group()
scales with the number of groups; for 10000 groups ~2s, 10M groups about 36s