Spark: disk I/O on stage boundaries explanation

It’s a good question in that we hear of in-memory Spark vs. Hadoop, so a little confusing. The docs are terrible, but I ran a few things and verified observations by looking around to find a most excellent source: http://hydronitrogen.com/apache-spark-shuffles-explained-in-depth.html

Assuming an Action has been called – so as to avoid the obvious comment if this is not stated, assuming we are not talking about ResultStage and a broadcast join, then we are talking about ShuffleMapStage. We look at an RDD initially.

Then, borrowing from the url:

  • DAG dependency involving a shuffle means creation of a separate Stage.
  • Map operations are followed by Reduce operations and a Map and so forth.

CURRENT STAGE

  • All the (fused) Map operations are performed intra-Stage.
  • The next Stage requirement, a Reduce operation – e.g. a reduceByKey, means the output is hashed or sorted by key (K) at end of the Map
    operations of current Stage.
  • This grouped data is written to disk on the Worker where the Executor is – or storage tied to that Cloud version. (I would have
    thought in memory was possible, if data is small, but this is an architectural Spark
    approach as stated from the docs.)
  • The ShuffleManager is notified that hashed, mapped data is available for consumption by the next Stage. ShuffleManager keeps track of all
    keys/locations once all of the map side work is done.

NEXT STAGE

  • The next Stage, being a reduce, then gets the data from those locations by consulting the Shuffle Manager and using Block Manager.
  • The Executor may be re-used or be a new on another Worker, or another Executor on same Worker.

So, my understanding is that architecturally, Stages mean writing to disk, even if enough memory. Given finite resources of a Worker it makes sense that writing to disk occurs for this type of operation. The more important point is, of course, the ‘Map Reduce’ implementation. I summarized the excellent posting, that is your canonical source.

Of course, fault tolerance is aided by this persistence, less re-computation work.

Similar aspects apply to DFs.

Leave a Comment