Any performance issues forcing eager evaluation using count in spark?

TL;DR 1) and 2) can be usually avoided but shouldn’t harm you (ignoring the cost of evaluation), 3) is typically a harmful Cargo cult programming practice.

Without cache

Calling count alone is mostly wasteful. While not always straightforward, logging can be replaced with information retrieved from listeners (here is and example for RDDs), and control flow requirements can be usually (not always) mediated with a better pipeline design.

Alone it won’t have any impact on execution plan (execution plan for count, is normally different from the execution plan of the parent anyway. In general Spark does as little work as possible, so it will remove parts of the execution plan, which are not required to compute count).

With cache:

count with cache is bad practice naively copied from patterns used with RDD API. It is already disputable with RDDs, but with DataFrame can break a lot of internal optimizations (selection and predicate pushdown) and technically speaking, is not even guaranteed to work.

Leave a Comment