hdfs - w3toppers.com

Read whole text files from a compression in Spark

One possible solution is to read data with binaryFiles and extract content manually. Scala: import org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream import org.apache.commons.compress.archivers.tar.TarArchiveInputStream import org.apache.spark.input.PortableDataStream import scala.util.Try import java.nio.charset._ def extractFiles(ps: PortableDataStream, n: Int = 1024) = Try { val tar = new TarArchiveInputStream(new GzipCompressorInputStream(ps.open)) Stream.continually(Option(tar.getNextTarEntry)) // Read until next exntry is null .takeWhile(_.isDefined) // flatten .flatMap(x => x) // … Read more

How does Spark partition(ing) work on files in HDFS?

When Spark reads a file from HDFS, it creates a single partition for a single input split. Input split is set by the Hadoop InputFormat used to read this file. For instance, if you use textFile() it would be TextInputFormat in Hadoop, which would return you a single partition for a single block of HDFS … Read more

While writing to hdfs path getting error java.io.IOException: Failed to rename

You can do all the selects in one single job, get all the selects and union in a single table. Dataset<Row> resultDs = givenItemList.parallelStream().map( item -> { String query = “select $item as itemCol , avg($item) as mean groupBy year”; return sparkSession.sql(query); }).reduce((a, b) -> a.union(b)).get saveDsToHdfs(hdfsPath, resultDs );

How does Hadoop process records split across block boundaries?

Interesting question, I spent some time looking at the code for the details and here are my thoughts. The splits are handled by the client by InputFormat.getSplits, so a look at FileInputFormat gives the following info: For each input file, get the file length, the block size and calculate the split size as max(minSize, min(maxSize, … Read more

Spark – load CSV file as DataFrame?

spark-csv is part of core Spark functionality and doesn’t require a separate library. So you could just do for example df = spark.read.format(“csv”).option(“header”, “true”).load(“csvfile.csv”) In scala,(this works for any format-in delimiter mention “,” for csv, “\t” for tsv etc) val df = sqlContext.read.format(“com.databricks.spark.csv”) .option(“delimiter”, “,”) .load(“csvfile.csv”)

Write to multiple outputs by key Spark – one Spark job

If you use Spark 1.4+, this has become much, much easier thanks to the DataFrame API. (DataFrames were introduced in Spark 1.3, but partitionBy(), which we need, was introduced in 1.4.) If you’re starting out with an RDD, you’ll first need to convert it to a DataFrame: val people_rdd = sc.parallelize(Seq((1, “alice”), (1, “bob”), (2, … Read more