How do I call a UDF on a Spark DataFrame using JAVA?

Spark >= 2.3 Scala-style udf can be invoked directly: import static org.apache.spark.sql.functions.*; import org.apache.spark.sql.expressions.UserDefinedFunction; UserDefinedFunction mode = udf( (Seq<String> ss) -> ss.headOption(), DataTypes.StringType ); df.select(mode.apply(col(“vs”))).show(); Spark < 2.3 Even if we assume that your UDF is useful and cannot be replaced by a simple getItem call it has incorrect signature. Array columns are exposed using … Read more

Derive multiple columns from a single column in a Spark DataFrame

Generally speaking what you want is not directly possible. UDF can return only a single column at the time. There are two different ways you can overcome this limitation: Return a column of complex type. The most general solution is a StructType but you can consider ArrayType or MapType as well. import org.apache.spark.sql.functions.udf val df … Read more

SQL Server 2008 – How do i return a User-Defined Table Type from a Table-Valued Function?

Even though you can not return the UDTT from a function, you can return a table variable and receive it in a UDTT as long as the schema match. The following code is tested in SQL Server 2008 R2 — Create the UDTT CREATE TYPE dbo.MyCustomUDDT AS TABLE ( FieldOne varchar (512), FieldTwo varchar(1024) ) … Read more

Writing UDF for looks up in the Map in java giving Unsupported literal type class java.util.HashMap

You can pass the look up map or array etc. to the udf by using partial. check out this example. from functools import partial from pyspark.sql.functions import udf fruit_dict = {“O”: “Orange”, “A”: “Apple”, “G”: “Grape”} df = spark.createDataFrame([(“A”, 20), (“G”, 30), (“O”, 10)], [“Label”, “Count”]) def decipher_fruit(label, fruit_map): label_names = list(fruit_map.keys()) if label in … Read more

Spark Error:expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)

The source of the problem is that object returned from the UDF doesn’t conform to the declared type. np.unique not only returns numpy.ndarray but also converts numerics to the corresponding NumPy types which are not compatible with DataFrame API. You can try something like this: udf(lambda x: list(set(x)), ArrayType(IntegerType())) or this (to keep order) udf(lambda … Read more

Returning from a function with OUT parameter

It would work like this: CREATE OR REPLACE FUNCTION name_function(param_1 varchar , OUT param_2 bigint) LANGUAGE plpgsql AS $func$ BEGIN INSERT INTO table (collumn_seq, param_1) — “param_1” also the column name? VALUES (DEFAULT, param_1) RETURNING collumn_seq INTO param2; END $func$; Normally, you would add a RETURN statement, but with OUT parameters, this is optional. Refer … Read more

How to find mean of grouped Vector columns in Spark SQL?

Spark >= 2.4 You can use Summarizer: import org.apache.spark.ml.stat.Summarizer val dfNew = df.as[(Int, org.apache.spark.mllib.linalg.Vector)] .map { case (group, v) => (group, v.asML) } .toDF(“group”, “features”) dfNew .groupBy($”group”) .agg(Summarizer.mean($”features”).alias(“means”)) .show(false) +—–+——————————————————————–+ |group|means | +—–+——————————————————————–+ |1 |[8.740630742016827E12,2.6124956666260462E14,3.268714653521495E14] | |6 |[2.1153266920139112E15,2.07232483974322592E17,6.2715161747245427E17]| |3 |[6.3781865566442836E13,8.359124419656149E15,1.865567821598214E14] | |5 |[4.270201403521642E13,6.561211706745676E13,8.395448246737938E15] | |9 |[3.577032684241448E16,2.5432362841314468E16,2.3744826986293008E17] | |4 |[2.339253775419023E14,8.517531902022505E13,3.055115780965264E15] | |8 |[8.029924756674456E15,7.284873600992855E17,3.08621303029924E15] | |7 |[3.2275104122699105E15,7.5472363442090208E16,7.022556624056291E14] … Read more