What are broadcast variables? What problems do they solve?

If you have a huge array that is accessed from Spark Closures, for example, some reference data, this array will be shipped to each spark node with closure. For example, if you have 10 nodes cluster with 100 partitions (10 partitions per node), this Array will be distributed at least 100 times (10 times to each node).

If you use broadcast, it will be distributed once per node using an efficient p2p protocol.

val array: Array[Int] = ??? // some huge array
val broadcasted = sc.broadcast(array)

And some RDD

val rdd: RDD[Int] = ???

In this case, array will be shipped with closure each time

rdd.map(i => array.contains(i))

and with broadcast, you’ll get a huge performance benefit

rdd.map(i => broadcasted.value.contains(i))

Broadcast variables are used to send shared data (for example application configuration) across all nodes/executors.

The broadcast value will be cached in all the executors.

Sample scala code creating broadcast variable at driver:

val broadcastedConfig:Broadcast[Option[Config]] = sparkSession.sparkContext.broadcast(objectToBroadcast)

Sample scala code receiving broadcasted variable at executor side:

val config =  broadcastedConfig.value

Leave a Comment