The source of the problem is that object returned from the UDF doesn’t conform to the declared type. np.unique
not only returns numpy.ndarray
but also converts numerics to the corresponding NumPy
types which are not compatible with DataFrame
API. You can try something like this:
udf(lambda x: list(set(x)), ArrayType(IntegerType()))
or this (to keep order)
udf(lambda xs: list(OrderedDict((x, None) for x in xs)),
ArrayType(IntegerType()))
instead.
If you really want np.unique
you have to convert the output:
udf(lambda x: np.unique(x).tolist(), ArrayType(IntegerType()))