Why is Spark faster than Hadoop Map Reduce

bafna’s answer provides the memory-side of the story, but I want to add other two important facts:DAG and ecosystem

  1. Spark uses “lazy evaluation” to form a directed acyclic graph (DAG) of consecutive computation stages. In this way, the execution plan can be optimized, e.g. to minimize shuffling data around. In contrast, this should be done manually in MapReduce by tuning each MR step. (It would be easier to understand this point if you are familiar with the execution plan optimization in RDBMS or the DAG-style execution of Apache Tez)
  2. Spark ecosystem has established a versatile stack of components to handle SQL, ML, Streaming, Graph Mining tasks. But in the hadoop ecosystem you have to install other packages to do these individual things.

And I want to add that, even if your data is too big for main memory, you can still use spark by choosing to persist you data on disk. Although by doing this you give up the advantages of in-memory processing, you can still benefit from the DAG execution optimization.

Some informative answers on Quora:
here and here.

Leave a Comment