Read whole text files from a compression in Spark

One possible solution is to read data with binaryFiles and extract content manually. Scala: import org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream import org.apache.commons.compress.archivers.tar.TarArchiveInputStream import org.apache.spark.input.PortableDataStream import scala.util.Try import java.nio.charset._ def extractFiles(ps: PortableDataStream, n: Int = 1024) = Try { val tar = new TarArchiveInputStream(new GzipCompressorInputStream(ps.open)) Stream.continually(Option(tar.getNextTarEntry)) // Read until next exntry is null .takeWhile(_.isDefined) // flatten .flatMap(x => x) // … Read more

While writing to hdfs path getting error java.io.IOException: Failed to rename

You can do all the selects in one single job, get all the selects and union in a single table. Dataset<Row> resultDs = givenItemList.parallelStream().map( item -> { String query = “select $item as itemCol , avg($item) as mean groupBy year”; return sparkSession.sql(query); }).reduce((a, b) -> a.union(b)).get saveDsToHdfs(hdfsPath, resultDs );

Spark – load CSV file as DataFrame?

spark-csv is part of core Spark functionality and doesn’t require a separate library. So you could just do for example df = spark.read.format(“csv”).option(“header”, “true”).load(“csvfile.csv”) In scala,(this works for any format-in delimiter mention “,” for csv, “\t” for tsv etc) val df = sqlContext.read.format(“com.databricks.spark.csv”) .option(“delimiter”, “,”) .load(“csvfile.csv”)