Why do I have to explicitly tell Spark what to cache?

A subjective list of reasons:

  • in practice caching is rarely needed and is useful mostly for iterative algorithms, breaking long lineages. For example typical ETL pipelines may not require caching at all. Cache most of the RDDs is definitely not the right choice.
  • there is no universal caching strategy. Actual choice depends on available recourses like amount of memory, disks (local, remote, storage service), file system (in-memory, on-disk) and particular application.
  • on-disk persistence is expensive, in memory persistence puts more stress on a JVM and is using the most valuable resource in Spark
  • it is impossible to cache automatically without making assumptions about application semantics. In particular:

    • expected behavior when data source changes. There is no universal answer and in many situations it can be impossible to automatically track changes
    • differentiating between deterministic and non-deterministic transformations and choosing between caching and re-computing
  • comparing Spark caching to OS level caching doesn’t make sense. The main goal of OS caching is to reduce latency. In Spark latency is usually not the most important factor and caching is used for other purposes like consistency, correctness and reducing stress on different parts of the system.
  • if cache doesn’t use off-heap storage than caching introduces additional pressure on the garbage collector. GC cost can be actually higher than a cost of recomputing the data.
  • depending on the data and caching method reading data from cache can be significantly less efficient memory-wise.
  • Caching interferes with more advanced optimizations available in Spark SQL, effectively disabling partition pruning or predicate and projection pushdown.

It is also worth to note that:

  • removing cached data is handled automatically using LRU
  • some data (like intermediate shuffle data) is persisted automatically. I acknowledge that it makes some of the previous arguments at least partially invalid.
  • Spark caching doesn’t affect system level or JVM level mechanisms

Leave a Comment