Filter spark DataFrame on string contains

You can use contains (this works with an arbitrary sequence): df.filter($”foo”.contains(“bar”)) like (SQL like with SQL simple regular expression whith _ matching an arbitrary character and % matching an arbitrary sequence): df.filter($”foo”.like(“bar”)) or rlike (like with Java regular expressions): df.filter($”foo”.rlike(“bar”)) depending on your requirements. LIKE and RLIKE should work with SQL expressions as well.

Is it possible to have tuple assignment to variables in Scala? [duplicate]

This isn’t simply “multiple variable assignment”, it’s fully-featured pattern matching! So the following are all valid: val (a, b) = (1, 2) val Array(a, b) = Array(1, 2) val h :: t = List(1, 2) val List(a, Some(b)) = List(1, Option(2)) This is the way that pattern matching works, it’ll de-construct something into smaller parts, … Read more

What is the differences between Int and Integer in Scala?

“What are the differences between Integer and Int?” Integer is just an alias for java.lang.Integer. Int is the Scala integer with the extra capabilities. Looking in Predef.scala you can see this the alias: /** @deprecated use <code>java.lang.Integer</code> instead */ @deprecated type Integer = java.lang.Integer However, there is an implicit conversion from Int to java.lang.Integer if … Read more

SBT stop run without exiting

From sbt version 0.13.5 you can add to your build.sbt cancelable in Global := true It is defined as “Enables (true) or disables (false) the ability to interrupt task execution with CTRL+C.” in the Keys definition If you are using Scala 2.12.7+ you can also cancel the compilation with CTRL+C. Reference https://github.com/scala/scala/pull/6479 There are some … Read more

Scala3: Crafting Types Through Metaprogramming?

Just in case, here is what I meant by hiding ViewOf inside a type class (type classes is an alternative to match types). Sadly, in Scala 3 this is wordy. (version 1) import scala.annotation.experimental import scala.quoted.{Expr, Quotes, Type, quotes} // Library part trait View extends Selectable { def applyDynamic(key: String)(args: Any*): Any = { println(s”$key … Read more

Spark DataFrames when udf functions do not accept large enough input variables

User defined functions are defined for up to 22 parameters. Only udf helper is define for at most 10 arguments. To handle functions with larger number of parameters you can use org.apache.spark.sql.UDFRegistration. For example val dummy = (( x0: Int, x1: Int, x2: Int, x3: Int, x4: Int, x5: Int, x6: Int, x7: Int, x8: … Read more

How to transpose an RDD in Spark

Say you have an N×M matrix. If both N and M are so small that you can hold N×M items in memory, it doesn’t make much sense to use an RDD. But transposing it is easy: val rdd = sc.parallelize(Seq(Seq(1, 2, 3), Seq(4, 5, 6), Seq(7, 8, 9))) val transposed = sc.parallelize(rdd.collect.toSeq.transpose) If N or … Read more

Spark UDF with varargs

UDFs don’t support varargs* but you can pass an arbitrary number of columns wrapped using an array function: import org.apache.spark.sql.functions.{udf, array, lit} val myConcatFunc = (xs: Seq[Any], sep: String) => xs.filter(_ != null).mkString(sep) val myConcat = udf(myConcatFunc) An example usage: val df = sc.parallelize(Seq( (null, “a”, “b”, “c”), (“d”, null, null, “e”) )).toDF(“x1”, “x2”, “x3”, … Read more