Coalesce reduces parallelism of entire stage (spark)

Actually it is not because of SparkSQL’s optimization, SparkSQL doesn’t change the position of Coalesce operator, as the executed plan shows:

Coalesce 1
+- *Project [value#2, UDF(value#2) AS udfResult#11]
   +- *SerializeFromObject [input[0, double, false] AS value#2]
      +- Scan ExternalRDDScan[obj#1]

I quote a paragraph from coalesce API’s description:

Note: This paragraph is added by the jira SPARK-19399. So it should not be found in 2.0’s API.

However, if you’re doing a drastic coalesce, e.g. to numPartitions =
1, this may result in your computation taking place on fewer nodes
than you like (e.g. one node in the case of numPartitions = 1). To
avoid this, you can call repartition. This will add a shuffle step,
but means the current upstream partitions will be executed in parallel
(per whatever the current partitioning is).

The coalesce API doesn’t perform a shuffle, but results in a narrow dependency between previous RDD and current RDD. As RDD is lazy evaluation, the computation is actually done with coalesced partitions.

To prevent it, you should use repartition API.

More Related Contents:

Leave a Comment Cancel reply