Why does join fail with “java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]”?

October 5, 2022 by Tarik Billa

This happens because Spark tries to do Broadcast Hash Join and one of the DataFrames is very large, so sending it consumes much time.

You can:

Set higher spark.sql.broadcastTimeout to increase timeout – spark.conf.set("spark.sql.broadcastTimeout", newValueForExample36000)
persist() both DataFrames, then Spark will use Shuffle Join – reference from here

PySpark

In PySpark, you can set the config when you build the spark context in the following manner:

spark = SparkSession
  .builder
  .appName("Your App")
  .config("spark.sql.broadcastTimeout", "36000")
  .getOrCreate()

Leave a Comment Cancel reply