Actually it is not because of SparkSQL’s optimization, SparkSQL doesn’t change the position of Coalesce operator, as the executed plan shows:
Coalesce 1
+- *Project [value#2, UDF(value#2) AS udfResult#11]
+- *SerializeFromObject [input[0, double, false] AS value#2]
+- Scan ExternalRDDScan[obj#1]
I quote a paragraph from coalesce API’s description:
Note: This paragraph is added by the jira SPARK-19399. So it should not be found in 2.0’s API.
However, if you’re doing a drastic coalesce, e.g. to numPartitions =
1, this may result in your computation taking place on fewer nodes
than you like (e.g. one node in the case of numPartitions = 1). To
avoid this, you can call repartition. This will add a shuffle step,
but means the current upstream partitions will be executed in parallel
(per whatever the current partitioning is).
The coalesce API doesn’t perform a shuffle, but results in a narrow dependency between previous RDD and current RDD. As RDD is lazy evaluation, the computation is actually done with coalesced partitions.
To prevent it, you should use repartition API.