Coalesce reduces parallelism of entire stage (spark)

Actually it is not because of SparkSQL’s optimization, SparkSQL doesn’t change the position of Coalesce operator, as the executed plan shows:

Coalesce 1
+- *Project [value#2, UDF(value#2) AS udfResult#11]
   +- *SerializeFromObject [input[0, double, false] AS value#2]
      +- Scan ExternalRDDScan[obj#1]

I quote a paragraph from coalesce API’s description:

Note: This paragraph is added by the jira SPARK-19399. So it should not be found in 2.0’s API.

However, if you’re doing a drastic coalesce, e.g. to numPartitions =
1, this may result in your computation taking place on fewer nodes
than you like (e.g. one node in the case of numPartitions = 1). To
avoid this, you can call repartition. This will add a shuffle step,
but means the current upstream partitions will be executed in parallel
(per whatever the current partitioning is).

The coalesce API doesn’t perform a shuffle, but results in a narrow dependency between previous RDD and current RDD. As RDD is lazy evaluation, the computation is actually done with coalesced partitions.

To prevent it, you should use repartition API.

Leave a Comment