Why does sortBy transformation trigger a Spark job?

sortBy is implemented using sortByKey which depends on a RangePartitioner (JVM) or partitioning function (Python). When you call sortBy / sortByKey partitioner (partitioning function) is initialized eagerly and samples input RDD to compute partition boundaries. Job you see corresponds to this process.

Actual sorting is performed only if you execute an action on the newly created RDD or its descendants.

More Related Contents:

Default Partitioning Scheme in Spark
Unable to fetch the value of Println in apache spark
Spark – repartition() vs coalesce()
How does HashPartitioner work?
How do I split an RDD into two or more RDDs?
Partitioning in spark while reading from RDBMS via JDBC
What does “Stage Skipped” mean in Apache Spark web UI?
How to control partition size in Spark SQL
Avoid performance impact of a single partition mode in Spark window functions
How to optimize partitioning when migrating data from JDBC source?
Spark: subtract two DataFrames
Which operations preserve RDD order?
Is groupByKey ever preferred over reduceByKey
How DAG works under the covers in RDD?
Spark parquet partitioning : Large number of files
Apache Spark: What is the equivalent implementation of RDD.groupByKey() using RDD.aggregateByKey()?
Determining optimal number of Spark partitions based on workers, cores and DataFrame size
Apache spark dealing with case statements
What is the difference between cache and persist?
How spark read a large file (petabyte) when file can not be fit in spark’s main memory
Spark lists all leaf node even in partitioned data
Does a join of co-partitioned RDDs cause a shuffle in Apache Spark?
Spark ALS predictAll returns empty
Why is the fold action necessary in Spark?
Overwrite specific partitions in spark dataframe write method
Spark MLlib LDA, how to infer the topics distribution of a new unseen document?
How to loop through each row of dataFrame in pyspark
Spark: Best practice for retrieving big data from RDD to local machine
How does Spark aggregate function – aggregateByKey work?
PySpark: How to fillna values in dataframe for specific columns?

More Related Contents:

Leave a Comment Cancel reply