Why are Spark Parquet files for an aggregate larger than the original?

In general columnar storage formats like Parquet are highly sensitive when it comes to data distribution (data organization) and cardinality of individual columns. The more organized is data and the lower is cardinality the more efficient is the storage.

Aggregation, as the one you apply, has to shuffle the data. When you check the execution plan you’ll see it is using hash partitioner. It means that after aggregation distribution can be less efficient than the one for the original data. At the same time sum can reduce number of rows but increase cardinality for rCount column.

You can try different tools to correct for that but not all are available in Spark 1.5.2:

  • Sort complete dataset by columns having low cardinality (quite expensive due to full shuffle) or sortWithinPartitions.
  • Use partitionBy method of DataFrameWriter to partition data using low cardinality columns.
  • Use bucketBy and sortBy methods of DataFrameWriter (Spark 2.0.0+) to improve data distribution using bucketing and local sorting.

Leave a Comment