Spark dataframe write method writing many small files

you have to repartiton your DataFrame to match the partitioning of the DataFrameWriter

Try this:

df
.repartition($"date")
.write.mode(SaveMode.Append)
.partitionBy("date")
.parquet(s"$path")

More Related Contents:

Apache Spark – Scala – ReduceByKey – with keys repeating up to twice only
How MapReduce work in Apache Spark and Scala?
Write to multiple outputs by key Spark – one Spark job
How does HashPartitioner work?
Spark 2.0 Dataset vs DataFrame
Spark – load CSV file as DataFrame?
Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects
How to define partitioning of DataFrame?
Case class equality in Apache Spark
Why accesing DataFrame from UDF results in NullPointerException?
Dropping a nested column from Spark DataFrame
How to read records in JSON format from Kafka using Structured Streaming?
Spark Dataframe :How to add a index Column : Aka Distributed Data Index
Scala spark, listbuffer is empty
Change nullable property of column in spark dataframe
Reading DataFrame from partitioned parquet file
How to convert unix timestamp to date in Spark
Perform a typed join in Scala with Spark Datasets
Apache Spark: Get number of records per partition
Left Anti join in Spark?
How to Define Custom partitioner for Spark RDDs of equally sized partition where each partition has equal number of elements?
DataFrame-ified zipWithIndex
Derive multiple columns from a single column in a Spark DataFrame
How to save a spark DataFrame as csv on disk?
How to sort by column in descending order in Spark SQL?
How to explode an array into multiple columns in Spark
Why is join not possible after show operator?
How to prevent java.lang.OutOfMemoryError: PermGen space at Scala compilation?
How to create a Dataset of Maps?
Spark ML VectorAssembler returns strange output

More Related Contents:

Leave a Comment Cancel reply