Skewed dataset join in Spark?

Pretty good article on how it can be done: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/

Short version:

  • Add random element to large RDD and create new join key with it
  • Add random element to small RDD using explode/flatMap to increase number of entries and create new join key
  • Join RDDs on new join key which will now be distributed better due to random seeding

Leave a Comment