I can’t seem to get –py-files on Spark to work

First off, I’ll assume that your dependencies are listed in requirements.txt. To package and zip the dependencies, run the following at the command line:

pip install -t dependencies -r requirements.txt
cd dependencies
zip -r ../dependencies.zip .

Above, the cd dependencies command is crucial to ensure that the modules are the in the top level of the zip file. Thanks to Dan Corin’s post for heads up.

Next, submit the job via:

spark-submit --py-files dependencies.zip spark_job.py

The --py-files directive sends the zip file to the Spark workers but does not add it to the PYTHONPATH (source of confusion for me). To add the dependencies to the PYTHONPATH to fix the ImportError, add the following line to the Spark job, spark_job.py:

sc.addPyFile("dependencies.zip")

A caveat from this Cloudera post:

An assumption that anyone doing distributed computing with commodity
hardware must assume is that the underlying hardware is potentially
heterogeneous. A Python egg built on a client machine will be specific
to the client’s CPU architecture because of the required C
compilation. Distributing an egg for a complex, compiled package like
NumPy, SciPy, or pandas is a brittle solution that is likely to fail
on most clusters, at least eventually.

Although the solution above does not build an egg, the same guideline applies.

Leave a Comment