The main difference, as you mentioned, is that textFile
will return an RDD with each line as an element while wholeTextFiles
returns a PairRDD with the key being the file path. If there is no need to separate the data depending on the file, simply use textFile
.
When reading uncompressed files with textFile
, it will split the data into chuncks of 32MB. This is advantagous from a memory perspective. This also means that the ordering of the lines is lost, if the order should be preserved then wholeTextFiles
should be used.
wholeTextFiles
will read the complete content of a file at once, it won’t be partially spilled to disk or partially garbage collected. Each file will be handled by one core and the data for each file will be one a single machine making it harder to distribute the load.