How to add a SparkListener from pySpark in Python?

It is possible although it is a bit involved. We can use Py4j callback mechanism to pass message from a SparkListener. First lets create a Scala package with all required classes. Directory structure: . ├── build.sbt └── src └── main └── scala └── net └── zero323 └── spark └── examples └── listener ├── Listener.scala ├── … Read more

Running custom Java class in PySpark

In PySpark try the following from py4j.java_gateway import java_import java_import(sc._gateway.jvm,”org.foo.module.Foo”) func = sc._gateway.jvm.Foo() func.fooMethod() Make sure that you have compiled your Java code into a runnable jar and submit the spark job like so spark-submit –driver-class-path “name_of_your_jar_file.jar” –jars “name_of_your_jar_file.jar” name_of_your_python_file.py

Why can’t PySpark find py4j.java_gateway?

In my environment (using docker and the image sequenceiq/spark:1.1.0-ubuntu), I ran in to this. If you look at the pyspark shell script, you’ll see that you need a few things added to your PYTHONPATH: export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH That worked in ipython for me. Update: as noted in the comments, the name of the py4j … Read more