SparkSQL on pyspark: how to generate time series?

EDIT
This creates a dataframe with one row containing an array of consecutive dates:

from pyspark.sql.functions import sequence, to_date, explode, col

spark.sql("SELECT sequence(to_date('2018-01-01'), to_date('2018-03-01'), interval 1 month) as date")

+------------------------------------------+
|                  date                    |
+------------------------------------------+
| ["2018-01-01","2018-02-01","2018-03-01"] |
+------------------------------------------+

You can use the explode function to “pivot” this array into rows:

spark.sql("SELECT sequence(to_date('2018-01-01'), to_date('2018-03-01'), interval 1 month) as date").withColumn("date", explode(col("date"))

+------------+
|    date    |
+------------+
| 2018-01-01 |
| 2018-02-01 |
| 2018-03-01 |
+------------+

(End of edit)

Spark v2.4 support sequence function:

sequence(start, stop, step) – Generates an array of elements from start to stop (inclusive), incrementing by step. The type of the returned elements is the same as the type of argument expressions.

Supported types are: byte, short, integer, long, date, timestamp.

Examples:

SELECT sequence(1, 5);

[1,2,3,4,5]

SELECT sequence(5, 1);

[5,4,3,2,1]

SELECT sequence(to_date(‘2018-01-01’), to_date(‘2018-03-01’), interval 1 month);

[2018-01-01,2018-02-01,2018-03-01]

https://docs.databricks.com/spark/latest/spark-sql/language-manual/functions.html#sequence

Leave a Comment