Spark ML VectorAssembler returns strange output

There is nothing strange about the output. Your vector seems to have lots of zero elements thus spark used it’s sparse representation. To explain further : It seems like your vector is composed of 18 elements (dimension). This indices [0,1,6,9,14,17] from the vector contains non zero elements which are in order [17.0,15.0,3.0,1.0,4.0,2.0] Sparse Vector representation … Read more

Spark textFile vs wholeTextFiles

The main difference, as you mentioned, is that textFile will return an RDD with each line as an element while wholeTextFiles returns a PairRDD with the key being the file path. If there is no need to separate the data depending on the file, simply use textFile. When reading uncompressed files with textFile, it will … Read more

Schema comparison of two dataframes in scala

Based on @Derek Kaknes‘s answer, here’s the solution I came up with for comparing schemas, being concerned only about column name, datatype & nullability and indifferent to metadata import org.apache.spark.sql.DataFrame import org.apache.spark.sql.types.{DataType, StructField} def getCleanedSchema(df: DataFrame): Map[String, (DataType, Boolean)] = { df.schema.map { (structField: StructField) => structField.name.toLowerCase -> (structField.dataType, structField.nullable) }.toMap } // Compare relevant … Read more

Loaner Pattern in Scala

Make sure that whatever you compute is evaluated eagerly and no longer depends on the resource. Scala makes lazy computation fairly easy. For instance, if you wrap scala.io.Source.fromFile in this way, you might try readFile(“test.txt”)(_.getLines) Unfortunately, this doesn’t work because getLines is lazy (returns an iterator). And Scala doesn’t have any great way to indicate … Read more

Scala single method interface implementation

Scala has experimental support for SAMs starting with 2.11, under the flag -Xexperimental: Welcome to Scala version 2.11.0-RC3 (OpenJDK 64-Bit Server VM, Java 1.7.0_51). Type in expressions to have them evaluated. Type :help for more information. scala> :set -Xexperimental scala> val r: Runnable = () => println(“hello world”) r: Runnable = $anonfun$1@7861ff33 scala> new Thread(r).run … Read more

scala parallel collections degree of parallelism

With the newest trunk, using the JVM 1.6 or newer, use the: collection.parallel.ForkJoinTasks.defaultForkJoinPool.setParallelism(parlevel: Int) This may be a subject to changes in the future, though. A more unified approach to configuring all Scala task parallel APIs is planned for the next releases. Note, however, that while this will determine the number of processors the query … Read more

Filter spark DataFrame on string contains

You can use contains (this works with an arbitrary sequence): df.filter($”foo”.contains(“bar”)) like (SQL like with SQL simple regular expression whith _ matching an arbitrary character and % matching an arbitrary sequence): df.filter($”foo”.like(“bar”)) or rlike (like with Java regular expressions): df.filter($”foo”.rlike(“bar”)) depending on your requirements. LIKE and RLIKE should work with SQL expressions as well.

Is it possible to have tuple assignment to variables in Scala? [duplicate]

This isn’t simply “multiple variable assignment”, it’s fully-featured pattern matching! So the following are all valid: val (a, b) = (1, 2) val Array(a, b) = Array(1, 2) val h :: t = List(1, 2) val List(a, Some(b)) = List(1, Option(2)) This is the way that pattern matching works, it’ll de-construct something into smaller parts, … Read more