First off, I’ll assume that your dependencies are listed in requirements.txt
. To package and zip the dependencies, run the following at the command line:
pip install -t dependencies -r requirements.txt
cd dependencies
zip -r ../dependencies.zip .
Above, the cd dependencies
command is crucial to ensure that the modules are the in the top level of the zip file. Thanks to Dan Corin’s post for heads up.
Next, submit the job via:
spark-submit --py-files dependencies.zip spark_job.py
The --py-files
directive sends the zip file to the Spark workers but does not add it to the PYTHONPATH
(source of confusion for me). To add the dependencies to the PYTHONPATH
to fix the ImportError
, add the following line to the Spark job, spark_job.py
:
sc.addPyFile("dependencies.zip")
A caveat from this Cloudera post:
An assumption that anyone doing distributed computing with commodity
hardware must assume is that the underlying hardware is potentially
heterogeneous. A Python egg built on a client machine will be specific
to the client’s CPU architecture because of the required C
compilation. Distributing an egg for a complex, compiled package like
NumPy, SciPy, or pandas is a brittle solution that is likely to fail
on most clusters, at least eventually.
Although the solution above does not build an egg, the same guideline applies.