How to connect HBase and Spark using Python?

I found this comment by one of the makers of hbase-spark, which seems to suggest there is a way to use PySpark to query HBase using Spark SQL.

And indeed, the pattern described here can be applied to query HBase with Spark SQL using PySpark, as the following example shows:

from pyspark import SparkContext
from pyspark.sql import SQLContext

sc = SparkContext()
sqlc = SQLContext(sc)

data_source_format="org.apache.hadoop.hbase.spark"

df = sc.parallelize([('a', '1.0'), ('b', '2.0')]).toDF(schema=['col0', 'col1'])

# ''.join(string.split()) in order to write a multi-line JSON string here.
catalog = ''.join("""{
    "table":{"namespace":"default", "name":"testtable"},
    "rowkey":"key",
    "columns":{
        "col0":{"cf":"rowkey", "col":"key", "type":"string"},
        "col1":{"cf":"cf", "col":"col1", "type":"string"}
    }
}""".split())


# Writing
df.write\
.options(catalog=catalog)\  # alternatively: .option('catalog', catalog)
.format(data_source_format)\
.save()

# Reading
df = sqlc.read\
.options(catalog=catalog)\
.format(data_source_format)\
.load()

I’ve tried hbase-spark-1.2.0-cdh5.7.0.jar (as distributed by Cloudera) for this, but ran into trouble (org.apache.hadoop.hbase.spark.DefaultSource does not allow create table as select when writing, java.util.NoSuchElementException: None.get when reading). As it turns out, the present version of CDH does not include the changes to hbase-spark that allow Spark SQL-HBase integration.

What does work for me is the shc Spark package, found here. The only change I had to make to the above script is to change:

data_source_format="org.apache.spark.sql.execution.datasources.hbase"

Here’s how I submit the above script on my CDH cluster, following the example from the shc README:

spark-submit --packages com.hortonworks:shc:1.0.0-1.6-s_2.10 --repositories http://repo.hortonworks.com/content/groups/public/ --files /opt/cloudera/parcels/CDH/lib/hbase/conf/hbase-site.xml example.py

Most of the work on shc seems to already be merged into the hbase-spark module of HBase, for release in version 2.0. With that, Spark SQL querying of HBase is possible using the above-mentioned pattern (see: https://hbase.apache.org/book.html#_sparksql_dataframes for details). My example above shows what it looks like for PySpark users.

Finally, a caveat: my example data above has only strings. Python data conversion is not supported by shc, so I had problems with integers and floats not showing up in HBase or with weird values.

Leave a Comment