I had the exact same problem and I found a way to do this using DataFrame.repartition()
. The problem with using coalesce(1)
is that your parallelism drops to 1, and it can be slow at best and error out at worst. Increasing that number doesn’t help either — if you do coalesce(10)
you get more parallelism, but end up with 10 files per partition.
To get one file per partition without using coalesce()
, use repartition()
with the same columns you want the output to be partitioned by. So in your case, do this:
import spark.implicits._
df
.repartition($"entity", $"year", $"month", $"day", $"status")
.write
.partitionBy("entity", "year", "month", "day", "status")
.mode(SaveMode.Append)
.parquet(s"$location")
Once I do that I get one parquet file per output partition, instead of multiple files.
I tested this in Python, but I assume in Scala it should be the same.