Spark DAG differs with 'withColumn' vs 'select'

when using nested withColumns and window functions?

Let’s say I want to do:

w1 = ...rangeBetween(-300, 0)
w2 = ...rowsBetween(-1,0)

(df.withColumn("some1", col(f.max("original1").over(w1))
   .withColumn("some2", lag("some1")).over(w2)).show()

I got a lot of memory problems and high spill even with very small datasets. If I do the same using select instead of withColumn it performs way faster.

df.select(
    f.max(col("original1")).over(w1).alias("some1"),
    f.lag("some1")).over(w2)
).show()

More Related Contents:

How to add a constant column in a Spark DataFrame?
Spark Dataframe distinguish columns with duplicated name
Pyspark: Split multiple array columns into rows
How do I add a new column to a Spark DataFrame (using PySpark)?
How to change a dataframe column from String type to Double type in PySpark?
Count number of non-NaN entries in each column of Spark dataframe with Pyspark
Retrieve top n in each group of a DataFrame in pyspark
Rename nested field in spark dataframe
Filtering a Pyspark DataFrame with SQL-like IN clause
Updating a dataframe column in spark
Dividing complex rows of dataframe to simple rows in Pyspark
Filter Pyspark dataframe column with None value
PySpark converting a column of type ‘map’ to multiple columns in a dataframe
How to explode multiple columns of a dataframe in pyspark
Spark add new column to dataframe with value from previous row
PySpark: multiple conditions in when clause
Pyspark: Replacing value in a column by searching a dictionary
Count number of non-NaN entries in each column of Spark dataframe in PySpark
Create Spark DataFrame. Can not infer schema for type
Pivot String column on Pyspark Dataframe
Filtering DataFrame using the length of a column
Multiple condition filter on dataframe
How to return a “Tuple type” in a UDF in PySpark?
How to split Vector into columns – using PySpark
creating spark data structure from multiline record
Pyspark: explode json in column to multiple columns
Cast column containing multiple string date formats to DateTime in Spark
Calculating the cosine similarity between all the rows of a dataframe in pyspark
Build a hierarchy from a relational data-set using Pyspark
how to add Row id in pySpark dataframes [duplicate]

Spark DAG differs with ‘withColumn’ vs ‘select’

Leave a Comment Cancel reply