ImportError: No module named numpy on spark workers

To use Spark in Yarn client mode, you’ll need to install any dependencies to the machines on which Yarn starts the executors. That’s the only surefire way to make this work.

Using Spark with Yarn cluster mode is a different story. You can distribute python dependencies with spark-submit.

spark-submit --master yarn-cluster my_script.py --py-files my_dependency.zip

However, the situation with numpy is complicated by the same thing that makes it so fast: the fact that does the heavy lifting in C. Because of the way that it is installed, you won’t be able to distribute numpy in this fashion.

Leave a Comment