Defining a UDF that accepts an Array of objects in a Spark DataFrame?

What you’re looking for is Seq[o.a.s.sql.Row]:

import org.apache.spark.sql.Row

val my_size = udf { subjects: Seq[Row] => subjects.size }

Explanation:

  • Current representation of ArrayType is, as you already know, WrappedArray so Array won’t work and it is better to stay on the safe side.
  • According to the official specification, the local (external) type for StructType is Row. Unfortunately it means that access to the individual fields is not type safe.

Notes:

  • To create struct in Spark < 2.3, function passed to udf has to return Product type (Tuple* or case class), not Row. That’s because corresponding udf variants depend on Scala reflection:

    Defines a Scala closure of n arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure’s signature.

  • In Spark >= 2.3 it is possible to return Row directly, as long as the schema is provided.

    def udf(f: AnyRef, dataType: DataType): UserDefinedFunction
    Defines a deterministic user-defined function (UDF) using a Scala closure. For this variant, the caller must specify the output data type, and there is no automatic input type coercion.

    See for example How to create a Spark UDF in Java / Kotlin which returns a complex type?.

Leave a Comment