This happens because Spark tries to do Broadcast Hash Join and one of the DataFrames is very large, so sending it consumes much time.
You can:
- Set higher
spark.sql.broadcastTimeout
to increase timeout –spark.conf.set("spark.sql.broadcastTimeout", newValueForExample36000)
persist()
both DataFrames, then Spark will use Shuffle Join – reference from here
PySpark
In PySpark, you can set the config when you build the spark context in the following manner:
spark = SparkSession
.builder
.appName("Your App")
.config("spark.sql.broadcastTimeout", "36000")
.getOrCreate()