Read Json file into a data.frame without nested lists

Update: 21 February 2016

col_fixer updated to include a vec2col argument that lets you flatten a list column into either a single string or a set of columns.


In the data.frame you’ve downloaded, I see several different column types. There are normal columns comprising vectors of the same type. There are list columns where the items may be NULL or may themselves be a flat vector. There are list columns where there are data.frames as the list elements. There are list columns that contain a data.frame of the same number of rows as the main data.frame.

Here’s a sample dataset that recreates those conditions:

mydf <- data.frame(id = 1:3, type = c("A", "A", "B"), 
                   facility = I(list(c("x", "y"), NULL, "x")),
  address = I(list(data.frame(v1 = 1, v2 = 2, v4 = 3), 
                   data.frame(v1 = 1:2, v2 = 3:4, v3 = 5), 
                   data.frame(v1 = 1, v2 = NA, v3 = 3))))

mydf$person <- data.frame(name = c("AA", "BB", "CC"), age = c(20, 32, 23),
                          preference = c(TRUE, FALSE, TRUE))

The str of this sample data.frame looks like:

str(mydf)
## 'data.frame':    3 obs. of  5 variables:
##  $ id      : int  1 2 3
##  $ type    : Factor w/ 2 levels "A","B": 1 1 2
##  $ facility:List of 3
##   ..$ : chr  "x" "y"
##   ..$ : NULL
##   ..$ : chr "x"
##   ..- attr(*, "class")= chr "AsIs"
##  $ address :List of 3
##   ..$ :'data.frame': 1 obs. of  3 variables:
##   .. ..$ v1: num 1
##   .. ..$ v2: num 2
##   .. ..$ v4: num 3
##   ..$ :'data.frame': 2 obs. of  3 variables:
##   .. ..$ v1: int  1 2
##   .. ..$ v2: int  3 4
##   .. ..$ v3: num  5 5
##   ..$ :'data.frame': 1 obs. of  3 variables:
##   .. ..$ v1: num 1
##   .. ..$ v2: logi NA
##   .. ..$ v3: num 3
##   ..- attr(*, "class")= chr "AsIs"
##  $ person  :'data.frame':    3 obs. of  3 variables:
##   ..$ name      : Factor w/ 3 levels "AA","BB","CC": 1 2 3
##   ..$ age       : num  20 32 23
##   ..$ preference: logi  TRUE FALSE TRUE
## NULL

One way you can “flatten” this is to “fix” the list columns. There are three fixes.

  1. flatten (from “jsonlite”) will take care of columns like the “person” column.
  2. Columns like the “facility” column can be fixed using toString, which would convert each element to a comma separated item or which can be converted into multiple columns.
  3. Columns where there are data.frames, some with multiple rows, first need to be flattened into a single row (by transforming to a “wide” format) and then need to be bound together as a single data.table. (I’m using “data.table” for reshaping and for binding the rows together).

We can take care of the second and third points with a function like the following:

col_fixer <- function(x, vec2col = FALSE) {
  if (!is.list(x[[1]])) {
    if (isTRUE(vec2col)) {
      as.data.table(data.table::transpose(x))
    } else {
      vapply(x, toString, character(1L))
    }
  } else {
    temp <- rbindlist(x, use.names = TRUE, fill = TRUE, idcol = TRUE)
    temp[, .time := sequence(.N), by = .id]
    value_vars <- setdiff(names(temp), c(".id", ".time"))
    dcast(temp, .id ~ .time, value.var = value_vars)[, .id := NULL]
  }
}

We’ll integrate that and the flatten function in another function that would do most of the processing.

Flattener <- function(indf, vec2col = FALSE) {
  require(data.table)
  require(jsonlite)
  indf <- flatten(indf)
  listcolumns <- sapply(indf, is.list)
  newcols <- do.call(cbind, lapply(indf[listcolumns], col_fixer, vec2col))
  indf[listcolumns] <- list(NULL)
  cbind(indf, newcols)
}

Running the function gives us:

Flattener(mydf)
##   id type person.name person.age person.preference facility address.v1_1
## 1  1    A          AA         20              TRUE     x, y            1
## 2  2    A          BB         32             FALSE                     1
## 3  3    B          CC         23              TRUE        x            1
##   address.v1_2 address.v2_1 address.v2_2 address.v4_1 address.v4_2 address.v3_1
## 1           NA            2           NA            3           NA           NA
## 2            2            3            4           NA           NA            5
## 3           NA           NA           NA           NA           NA            3
##   address.v3_2
## 1           NA
## 2            5
## 3           NA

Or, with the vectors going into separate columns:

Flattener(mydf, TRUE)
##   id type person.name person.age person.preference facility.V1 facility.V2
## 1  1    A          AA         20              TRUE           x           y
## 2  2    A          BB         32             FALSE        <NA>        <NA>
## 3  3    B          CC         23              TRUE           x        <NA>
##   address.v1_1 address.v1_2 address.v2_1 address.v2_2 address.v4_1 address.v4_2
## 1            1           NA            2           NA            3           NA
## 2            1            2            3            4           NA           NA
## 3            1           NA           NA           NA           NA           NA
##   address.v3_1 address.v3_2
## 1           NA           NA
## 2            5            5
## 3            3           NA

Here’s the str:

str(Flattener(mydf))
## 'data.frame':    3 obs. of  14 variables:
##  $ id               : int  1 2 3
##  $ type             : Factor w/ 2 levels "A","B": 1 1 2
##  $ person.name      : Factor w/ 3 levels "AA","BB","CC": 1 2 3
##  $ person.age       : num  20 32 23
##  $ person.preference: logi  TRUE FALSE TRUE
##  $ facility         : chr  "x, y" "" "x"
##  $ address.v1_1     : num  1 1 1
##  $ address.v1_2     : num  NA 2 NA
##  $ address.v2_1     : num  2 3 NA
##  $ address.v2_2     : num  NA 4 NA
##  $ address.v4_1     : num  3 NA NA
##  $ address.v4_2     : num  NA NA NA
##  $ address.v3_1     : num  NA 5 3
##  $ address.v3_2     : num  NA 5 NA
## NULL

On your “providers” object, this runs very quickly and consistently:

library(microbenchmark)
out <- microbenchmark(Flattener(providers), Flattener(providers, TRUE), flattenList(jsonRList))
out
# Unit: milliseconds
#                        expr        min         lq      mean    median        uq       max neval
#        Flattener(providers)  104.18939  126.59295  157.3744  138.4185  174.5222  308.5218   100
#  Flattener(providers, TRUE)   67.56471   86.37789  109.8921   96.3534  121.4443  301.4856   100
#      flattenList(jsonRList) 1780.44981 2065.50533 2485.1924 2269.4496 2694.1487 4397.4793   100

library(ggplot2)
qplot(y = time, data = out, colour = expr) ## Via @TylerRinker

enter image description here

Leave a Comment