The main difference, as you mentioned, is that
textFile will return an RDD with each line as an element while
wholeTextFiles returns a PairRDD with the key being the file path. If there is no need to separate the data depending on the file, simply use
When reading uncompressed files with
textFile, it will split the data into chuncks of 32MB. This is advantagous from a memory perspective. This also means that the ordering of the lines is lost, if the order should be preserved then
wholeTextFiles should be used.
wholeTextFiles will read the complete content of a file at once, it won’t be partially spilled to disk or partially garbage collected. Each file will be handled by one core and the data for each file will be one a single machine making it harder to distribute the load.