Convert null values to empty array in Spark DataFrame

You can use an UDF:

import org.apache.spark.sql.functions.udf

val array_ = udf(() => Array.empty[Int])

combined with WHEN or COALESCE:

df.withColumn("myCol", when(myCol.isNull, array_()).otherwise(myCol))
df.withColumn("myCol", coalesce(myCol, array_())).show

In the recent versions you can use array function:

import org.apache.spark.sql.functions.{array, lit}

df.withColumn("myCol", when(myCol.isNull, array().cast("array<integer>")).otherwise(myCol))
df.withColumn("myCol", coalesce(myCol, array().cast("array<integer>"))).show

Please note that it will work only if conversion from string to the desired type is allowed.

The same thing can be of course done in PySpark as well. For the legacy solutions you can define udf

from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, IntegerType

def empty_array(t):
    return udf(lambda: [], ArrayType(t()))()

coalesce(myCol, empty_array(IntegerType()))

and in the recent versions just use array:

from pyspark.sql.functions import array

coalesce(myCol, array().cast("array<integer>"))

Leave a Comment