Write to multiple outputs by key Spark – one Spark job
If you use Spark 1.4+, this has become much, much easier thanks to the DataFrame API. (DataFrames were introduced in Spark 1.3, but partitionBy(), which we need, was introduced in 1.4.) If you’re starting out with an RDD, you’ll first need to convert it to a DataFrame: val people_rdd = sc.parallelize(Seq((1, “alice”), (1, “bob”), (2, … Read more