Spark Error:expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)

The source of the problem is that object returned from the UDF doesn’t conform to the declared type. np.unique not only returns numpy.ndarray but also converts numerics to the corresponding NumPy types which are not compatible with DataFrame API. You can try something like this:

udf(lambda x: list(set(x)), ArrayType(IntegerType()))

or this (to keep order)

udf(lambda xs: list(OrderedDict((x, None) for x in xs)), 
    ArrayType(IntegerType()))

instead.

If you really want np.unique you have to convert the output:

udf(lambda x: np.unique(x).tolist(), ArrayType(IntegerType()))

Leave a Comment