Apache Spark – Scala – ReduceByKey – with keys repeating up to twice only

What (probably) takes the time here is shuffling the data: when you want to group two or more records together, they must reside within the same partition, so Spark has to first shuffle the records so that all records with same key are in a single partition.

Now, even if each key has two records at most, this shuffle will have to take place, unless you can somehow guarantee that each key is already contained in a single partition – for example, if you loaded this RDD from HDFS and you somehow know that each key resides on a single file part to begin with. In that (unlikely) case, you can use mapPartitions to perform the grouping yourself on each partition separately, thus saving the shuffle:

vectors.mapPartitions( 
  iter => iter.toList.groupBy(_._1).map { case (k, list) => (k, list.map(_._2).reduce(_ * _)) }.iterator, 
  preservesPartitioning = true)

None of this is special to the case where the maximum repetition of each key is 2, by the way.

Leave a Comment