Pretty good article on how it can be done: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/
Short version:
- Add random element to large RDD and create new join key with it
- Add random element to small RDD using explode/flatMap to increase number of entries and create new join key
- Join RDDs on new join key which will now be distributed better due to random seeding