How to control partition size in Spark SQL

Spark < 2.0:

You can use Hadoop configuration options:

mapred.min.split.size.
mapred.max.split.size

as well as HDFS block size to control partition size for filesystem based formats*.

val minSplit: Int = ???
val maxSplit: Int = ???

sc.hadoopConfiguration.setInt("mapred.min.split.size", minSplit)
sc.hadoopConfiguration.setInt("mapred.max.split.size", maxSplit)

Spark 2.0+:

You can use spark.sql.files.maxPartitionBytes configuration:

spark.conf.set("spark.sql.files.maxPartitionBytes", maxSplit)

In both cases these values may not be in use by a specific data source API so you should always check documentation / implementation details of the format you use.

* Other input formats can use different settings. See for example

Partitioning in spark while reading from RDBMS via JDBC
Difference between mapreduce split and spark paritition

Furthermore Datasets created from RDDs will inherit partition layout from their parents.

Similarly bucketed tables will use bucket layout defined in the metastore with 1:1 relationship between bucket and Dataset partition.

More Related Contents:

Leave a Comment Cancel reply