pyspark: rolling average using timeseries data

I figured out the correct way to calculate a moving/rolling average using this stackoverflow: Spark Window Functions – rangeBetween dates The basic idea is to convert your timestamp column to seconds, and then you can use the rangeBetween function in the pyspark.sql.Window class to include the correct rows in your window. Here’s the solved example: … Read more

What is spark.driver.maxResultSize?

assuming that a worker wants to send 4G of data to the driver, then having spark.driver.maxResultSize=1G, will cause the worker to send 4 messages (instead of 1 with unlimited spark.driver.maxResultSize). No. If estimated size of the data is larger than maxResultSize given job will be aborted. The goal here is to protect your application from … Read more

What do the numbers on the progress bar mean in spark-shell?

What you get is a Console Progress Bar, [Stage 7: shows the stage you are in now, and (14174 + 5) / 62500] is (numCompletedTasks + numActiveTasks) / totalNumOfTasksInThisStage]. The progress bar shows numCompletedTasks / totalNumOfTasksInThisStage. It will be shown when both spark.ui.showConsoleProgress is true (by default) and log level in conf/log4j.properties is ERROR or … Read more