Python Spark Cumulative Sum by Group Using DataFrame
This can be done using a combination of a window function and the Window.unboundedPreceding value in the window’s range as follows: from pyspark.sql import Window from pyspark.sql import functions as F windowval = (Window.partitionBy(‘class’).orderBy(‘time’) .rangeBetween(Window.unboundedPreceding, 0)) df_w_cumsum = df.withColumn(‘cum_sum’, F.sum(‘value’).over(windowval)) df_w_cumsum.show() +—-+—–+—–+——-+ |time|value|class|cum_sum| +—-+—–+—–+——-+ | 1| 3| b| 3| | 2| 3| b| 6| | … Read more