Why does SparkSession execute twice for one action?

It happens because you don’t provide schema for DataFrameReader. As a result Spark has to eagerly scan data set to infer output schema.

Since mappedRdd is not cached it will be evaluated twice:

  • once for schema inference
  • once when you call data.show

If you want to prevent you should provide schema for reader (Scala syntax):

val schema: org.apache.spark.sql.types.StructType = ???
spark.read.schema(schema).json(mappedRdd)

Leave a Comment