Spark DAG differs with ‘withColumn’ vs ‘select’

when using nested withColumns and window functions? Let’s say I want to do: w1 = …rangeBetween(-300, 0) w2 = …rowsBetween(-1,0) (df.withColumn(“some1”, col(f.max(“original1”).over(w1)) .withColumn(“some2”, lag(“some1”)).over(w2)).show() I got a lot of memory problems and high spill even with very small datasets. If I do the same using select instead of withColumn it performs way faster. df.select( f.max(col(“original1”)).over(w1).alias(“some1”), … Read more

Parallel execution of directed acyclic graph of tasks

The other answer works fine but is too complicated. A simpler way is to just execute Kahn’s algorithm but in parallel. The key is to execute all the tasks in parallel for whom all dependencies have been executed. import java.time.Instant; import java.util.ArrayList; import java.util.List; import java.util.concurrent.ConcurrentHashMap; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; import java.util.concurrent.atomic.AtomicInteger; class DependencyManager { … Read more