Spark: disk I/O on stage boundaries explanation

It’s a good question in that we hear of in-memory Spark vs. Hadoop, so a little confusing. The docs are terrible, but I ran a few things and verified observations by looking around to find a most excellent source: http://hydronitrogen.com/apache-spark-shuffles-explained-in-depth.html

Assuming an Action has been called – so as to avoid the obvious comment if this is not stated, assuming we are not talking about ResultStage and a broadcast join, then we are talking about ShuffleMapStage. We look at an RDD initially.

Then, borrowing from the url:

DAG dependency involving a shuffle means creation of a separate Stage.
Map operations are followed by Reduce operations and a Map and so forth.

CURRENT STAGE

All the (fused) Map operations are performed intra-Stage.

The next Stage requirement, a Reduce operation – e.g. a reduceByKey, means the output is hashed or sorted by key (K) at end of the Map
operations of current Stage.

This grouped data is written to disk on the Worker where the Executor is – or storage tied to that Cloud version. (I would have
thought in memory was possible, if data is small, but this is an architectural Spark
approach as stated from the docs.)

The ShuffleManager is notified that hashed, mapped data is available for consumption by the next Stage. ShuffleManager keeps track of all
keys/locations once all of the map side work is done.

NEXT STAGE

The next Stage, being a reduce, then gets the data from those locations by consulting the Shuffle Manager and using Block Manager.

The Executor may be re-used or be a new on another Worker, or another Executor on same Worker.

So, my understanding is that architecturally, Stages mean writing to disk, even if enough memory. Given finite resources of a Worker it makes sense that writing to disk occurs for this type of operation. The more important point is, of course, the ‘Map Reduce’ implementation. I summarized the excellent posting, that is your canonical source.

Of course, fault tolerance is aided by this persistence, less re-computation work.

Similar aspects apply to DFs.

More Related Contents:

Leave a Comment Cancel reply