How to submit Spark jobs to EMR cluster from Airflow?

While it may not directly address your particular query, broadly, here are some ways you can trigger spark-submit on (remote) EMR via Airflow

  1. Use Apache Livy

    • This solution is actually independent of remote server, i.e., EMR
    • Here’s an example
    • The downside is that Livy is in early stages and its API appears incomplete and wonky to me
  2. Use EmrSteps API

    • Dependent on remote system: EMR
    • Robust, but since it is inherently async, you will also need an EmrStepSensor (alongside EmrAddStepsOperator)
    • On a single EMR cluster, you cannot have more than one steps running simultaneously (although some hacky workarounds exist)
  3. Use SSHHook / SSHOperator

    • Again independent of remote system
    • Comparatively easier to get started with
    • If your spark-submit command involves a lot of arguments, building that command (programmatically) can become cumbersome

EDIT-1

There seems to be another straightforward way

  1. Specifying remote master-IP


Useful links

Leave a Comment