I think this question has been well answered in the FAQ panel of Spark website (https://spark.apache.org/faq.html):
- What happens if my dataset does not fit in memory?
Often each partition of data is small and does fit in memory, and these partitions are processed a few at a time. For very large partitions that do not fit in memory, Spark’s built-in operators perform external operations on datasets.
- What happens when a cached dataset does not fit in memory?
Spark can either spill it to disk or recompute the partitions that don’t fit in RAM each time they are requested. By default, it uses recomputation, but you can set a dataset’s storage level to MEMORY_AND_DISK to avoid this.