DataFrame partitionBy to a single Parquet file (per partition)

I had the exact same problem and I found a way to do this using DataFrame.repartition(). The problem with using coalesce(1) is that your parallelism drops to 1, and it can be slow at best and error out at worst. Increasing that number doesn’t help either — if you do coalesce(10) you get more parallelism, but end up with 10 files per partition.

To get one file per partition without using coalesce(), use repartition() with the same columns you want the output to be partitioned by. So in your case, do this:

import spark.implicits._
df
  .repartition($"entity", $"year", $"month", $"day", $"status")
  .write
  .partitionBy("entity", "year", "month", "day", "status")
  .mode(SaveMode.Append)
  .parquet(s"$location")

Once I do that I get one parquet file per output partition, instead of multiple files.

I tested this in Python, but I assume in Scala it should be the same.

More Related Contents:

Leave a Comment Cancel reply