Spark DAG differs with ‘withColumn’ vs ‘select’

when using nested withColumns and window functions?

Let’s say I want to do:

w1 = ...rangeBetween(-300, 0)
w2 = ...rowsBetween(-1,0)

(df.withColumn("some1", col(f.max("original1").over(w1))
   .withColumn("some2", lag("some1")).over(w2)).show()

I got a lot of memory problems and high spill even with very small datasets. If I do the same using select instead of withColumn it performs way faster.

df.select(
    f.max(col("original1")).over(w1).alias("some1"),
    f.lag("some1")).over(w2)
).show()

Leave a Comment