Why does SparkSession execute twice for one action?

It happens because you don’t provide schema for DataFrameReader. As a result Spark has to eagerly scan data set to infer output schema.

Since mappedRdd is not cached it will be evaluated twice:

once for schema inference
once when you call data.show

If you want to prevent you should provide schema for reader (Scala syntax):

val schema: org.apache.spark.sql.types.StructType = ???
spark.read.schema(schema).json(mappedRdd)

More Related Contents:

How to handle this scenario in spark? [closed]
How to do aggregate functions, may columns and extract back
How to flatten a struct in a Spark dataframe?
How to create SparkSession with Hive support (fails with “Hive classes are not found”)?
Spark sql how to explode without losing null values
Spark Strutured Streaming automatically converts timestamp to local time
spark broadcast variable Map giving null value
How to set timezone to UTC in Apache Spark?
Change output filename prefix for DataFrame.write()
How do I call a UDF on a Spark DataFrame using JAVA?
Retain keys with null values while writing JSON in spark
How to use Column.isin in Java?
Add JAR files to a Spark job – spark-submit
Spark: How to map Python with Scala or Java User Defined Functions?
Read whole text files from a compression in Spark
Spark read file from S3 using sc.textFile (“s3n://…)
java.lang.ClassCastException using lambda expressions in spark job on remote server
Spark Error – Unsupported class file major version
How can I update a broadcast variable in spark streaming?
Apache Spark – foreach Vs foreachPartition When to use What?
Matrix Multiplication in Apache Spark [closed]
What are the Spark transformations that causes a Shuffle?
TaskSchedulerImpl: Initial job has not accepted any resources;
Spark spark-submit –jars arguments wants comma list, how to declare a directory of jars?
Is gzip format supported in Spark?
How to get element by Index in Spark RDD (Java)
How to save models from ML Pipeline to S3 or HDFS?
Converting mysql table to spark dataset is very slow compared to same from csv file
Spark: get number of cluster cores programmatically
Running custom Java class in PySpark

More Related Contents:

Leave a Comment Cancel reply