About how to add a new column to an existing DataFrame with random values in Scala

Spark >= 2.3

It is possible to disable some optimizations using asNondeterministic method:

import org.apache.spark.sql.expressions.UserDefinedFunction

val f: UserDefinedFunction = ???
val fNonDeterministic: UserDefinedFunction = f.asNondeterministic

Please make sure you understand the guarantees before using this option.

Spark < 2.3

Function which is passed to udf should be deterministic (with possible exception of SPARK-20586) and nullary functions calls can be replaced by constants. If you want to generate random numbers use on of the built-in functions:

  • randGenerate a random column with independent and identically distributed (i.i.d.) samples from U[0.0, 1.0].
  • randnGenerate a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.

and transform the output to obtain required distribution for example:

(rand * Integer.MAX_VALUE).cast("bigint").cast("string")

Leave a Comment