Skewed dataset join in Spark?
Pretty good article on how it can be done: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/ Short version: Add random element to large RDD and create new join key with it Add random element to small RDD using explode/flatMap to increase number of entries and create new join key Join RDDs on new join key which will now be distributed better … Read more