Write Spark dataframe as CSV with partitions

Spark 2.0.0+: Built-in csv format supports partitioning out of the box so you should be able to simply use: df.write.partitionBy(‘partition_date’).mode(mode).format(“csv”).save(path) without including any additional packages. Spark < 2.0.0: At this moment (v1.4.0) spark-csv doesn’t support partitionBy (see databricks/spark-csv#123) but you can adjust built-in sources to achieve what you want. You can try two different approaches. … Read more

How to update partition metadata in Hive , when partition data is manualy deleted from HDFS

EDIT : Starting with Hive 3.0.0 MSCK can now discover new partitions or remove missing partitions (or both) using the following syntax : MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS] This was implemented in HIVE-17824 As correctly stated by HakkiBuyukcengiz, MSCK REPAIR doesn’t remove partitions if the corresponding folder on HDFS was manually deleted, it only … Read more

Determining optimal number of Spark partitions based on workers, cores and DataFrame size

Yes, a spark application has one and only Driver. What is the relationship between numWorkerNodes and numExecutors? A worker can host multiple executors, you can think of it like the worker to be the machine/node of your cluster and the executor to be a process (executing in a core) that runs on that worker. So … Read more

Why does sortBy transformation trigger a Spark job?

sortBy is implemented using sortByKey which depends on a RangePartitioner (JVM) or partitioning function (Python). When you call sortBy / sortByKey partitioner (partitioning function) is initialized eagerly and samples input RDD to compute partition boundaries. Job you see corresponds to this process. Actual sorting is performed only if you execute an action on the newly … Read more

Split a list of numbers into n chunks such that the chunks have (close to) equal sums and keep the original order

This approach defines partition boundaries that divide the array in roughly equal numbers of elements, and then repeatedly searches for better partitionings until it can’t find any more. It differs from most of the other posted solutions in that it looks to find an optimal solution by trying multiple different partitionings. The other solutions attempt … Read more