In general columnar storage formats like Parquet are highly sensitive when it comes to data distribution (data organization) and cardinality of individual columns. The more organized is data and the lower is cardinality the more efficient is the storage.
Aggregation, as the one you apply, has to shuffle the data. When you check the execution plan you’ll see it is using hash partitioner. It means that after aggregation distribution can be less efficient than the one for the original data. At the same time sum
can reduce number of rows but increase cardinality for rCount
column.
You can try different tools to correct for that but not all are available in Spark 1.5.2:
- Sort complete dataset by columns having low cardinality (quite expensive due to full shuffle) or
sortWithinPartitions
. - Use
partitionBy
method ofDataFrameWriter
to partition data using low cardinality columns. - Use
bucketBy
andsortBy
methods ofDataFrameWriter
(Spark 2.0.0+) to improve data distribution using bucketing and local sorting.