Spark DataFrames support predicate push-down with JDBC sources but term predicate is used in a strict SQL meaning. It means it covers only WHERE
clause. Moreover it looks like it is limited to the logical conjunction (no IN
and OR
I am afraid) and simple predicates.
Everything else, like limits, counts, ordering, groups and conditions is processed on the Spark side. One caveat, already covered on SO, is that df.count()
or sqlContext.sql("SELECT COUNT(*) FROM df")
is translated to SELECT 1 FROM df
and requires both substantial data transfer and processing using Spark.
Does it mean it is a lost cause? Not exactly. It is possible to use an arbitrary subquery as a table
argument. It is less convenient than a predicate pushdown but otherwise works pretty well:
n = ... # Number of rows to take
sql = "(SELECT * FROM dummy LIMIT {0}) AS tmp".format(int(n))
df = sqlContext.read.jdbc(url=url, table=sql)
Note:
This behavior may be improved in the future, once Data Source API v2 is ready: