Reading in multiple files compressed in tar.gz archive into Spark [duplicate]

A solution is given in Read whole text files from a compression in Spark .
Using the code sample provided, I was able to create a DataFrame from the compressed archive like so:

val jsonRDD = sc.binaryFiles("gzarchive/*").
               flatMapValues(x => extractFiles(x).toOption).
               mapValues(_.map(decode())

val df = sqlContext.read.json(jsonRDD.map(_._2).flatMap(x => x))

This method works fine for tar archives of a relatively small size, but is not suitable for large archive sizes.

A better solution to the problem seems to be to convert the tar archives to Hadoop SequenceFiles, which are splittable and hence can be read and processed in parallel in Spark (as opposed to tar archives.)

See: A Million Little Files – Digital Digressions by Stuart Sierra.

Leave a Comment