Spark - repartition() vs coalesce()

It avoids a full shuffle. If it’s known that the number is decreasing then the executor can safely keep data on the minimum number of partitions, only moving the data off the extra nodes, onto the nodes that we kept.

So, it would go something like this:

Node 1 = 1,2,3
Node 2 = 4,5,6
Node 3 = 7,8,9
Node 4 = 10,11,12

Then coalesce down to 2 partitions:

Node 1 = 1,2,3 + (10,11,12)
Node 3 = 7,8,9 + (4,5,6)

Notice that Node 1 and Node 3 did not require its original data to move.

More Related Contents:

What is the difference between cache and persist?
Unable to fetch the value of Println in apache spark
How do I split an RDD into two or more RDDs?
What does “Stage Skipped” mean in Apache Spark web UI?
Spark: subtract two DataFrames
Which operations preserve RDD order?
Is groupByKey ever preferred over reduceByKey
What are workers, executors, cores in Spark Standalone cluster?
Default Partitioning Scheme in Spark
How DAG works under the covers in RDD?
Spark parquet partitioning : Large number of files
What is a task in Spark? How does the Spark worker execute the jar file?
Apache Spark: What is the equivalent implementation of RDD.groupByKey() using RDD.aggregateByKey()?
Why does sortBy transformation trigger a Spark job?
Determining optimal number of Spark partitions based on workers, cores and DataFrame size
Apache spark dealing with case statements
How does Spark aggregate function – aggregateByKey work?
How spark read a large file (petabyte) when file can not be fit in spark’s main memory
What is spark.driver.maxResultSize?
Does a join of co-partitioned RDDs cause a shuffle in Apache Spark?
Spark ALS predictAll returns empty
Why is the fold action necessary in Spark?
Spark Standalone Number Executors/Cores Control
How to get word details from TF Vector RDD in Spark ML Lib?
Apache Spark: setting executor instances does not change the executors
Number of partitions in RDD and performance in Spark
Keep only duplicates from a DataFrame regarding some field
Overwrite only some partitions in a partitioned spark Dataset
Using windowing functions in Spark
PySpark: How to fillna values in dataframe for specific columns?

Spark – repartition() vs coalesce()

Leave a Comment Cancel reply