What is a task in Spark? How does the Spark worker execute the jar file?

When you create the SparkContext, each worker starts an executor. This is a separate process (JVM), and it loads your jar too. The executors connect back to your driver program. Now the driver can send them commands, like flatMap, map and reduceByKey in your example. When the driver quits, the executors shut down.

RDDs are sort of like big arrays that are split into partitions, and each executor can hold some of these partitions.

A task is a command sent from the driver to an executor by serializing your Function object. The executor deserializes the command (this is possible because it has loaded your jar), and executes it on a partition.

(This is a conceptual overview. I am glossing over some details, but I hope it is helpful.)


To answer your specific question: No, a new process is not started for each step. A new process is started on each worker when the SparkContext is constructed.

Leave a Comment