It would seem that Option B is required. The reason is related to how persist/cache and unpersist are executed by Spark. Since RDD transformations merely build DAG descriptions without execution, in Option A by the time you call unpersist, you still only have job descriptions and not a running execution.
This is relevant because a cache
or persist
call just adds the RDD to a Map of RDDs that marked themselves to be persisted during job execution. However, unpersist
directly tells the blockManager to evict the RDD from storage and removes the reference in the Map of persistent RDDs.
So you would need to call unpersist after Spark actually executed and stored the RDD with the block manager.
The comments for the RDD.persist
method hint towards this:
rdd.persist