What is the use of grouping comparator in hadoop map reduce

In support of the chosen answer I add: Following on from this explanation **Input**: symbol time price a 1 10 a 2 20 b 3 30 **Map output**: create composite key\values like so: > symbol-time time-price > >**a-1**         1-10 > >**a-2**         2-20 > >**b-3**         3-30 The Partitioner: will route the a-1 and a-2 keys to the same reducer … Read more

Change output filename prefix for DataFrame.write()

You cannot change the “part” prefix while using any of the standard output formats (like Parquet). See this snippet from ParquetRelation source code: private val recordWriter: RecordWriter[Void, InternalRow] = { val outputFormat = { new ParquetOutputFormat[InternalRow]() { // … override def getDefaultWorkFile(context: TaskAttemptContext, extension: String): Path = { // .. // prefix is hard-coded here: … Read more

MultipleOutputFormat in hadoop

Each reducer uses an OutputFormat to write records to. So that’s why you are getting a set of odd and even files per reducer. This is by design so that each reducer can perform writes in parallel. If you want just a single odd and single even file, you’ll need to set mapred.reduce.tasks to 1. … Read more

Is gzip format supported in Spark?

From the Spark Scala Programming guide’s section on “Hadoop Datasets”: Spark can create distributed datasets from any file stored in the Hadoop distributed file system (HDFS) or other storage systems supported by Hadoop (including your local file system, Amazon S3, Hypertable, HBase, etc). Spark supports text files, SequenceFiles, and any other Hadoop InputFormat. Support for … Read more