Why does partition parameter of SparkContext.textFile not take effect?

If you take a look at the signature

textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] 

you’ll see that the argument you use is called minPartitions and this pretty much describes its function. In some cases even that is ignored but it is a different matter. Input format which is used behind the scenes still decides how to compute splits.

In this particular case you could probably use mapred.min.split.size to increase split size (this will work during load) or simply repartition after loading (this will take effect after data is loaded) but in general there should be no need for that.

Leave a Comment