Spark ML VectorAssembler returns strange output

There is nothing strange about the output. Your vector seems to have lots of zero elements thus spark used it’s sparse representation. To explain further : It seems like your vector is composed of 18 elements (dimension). This indices [0,1,6,9,14,17] from the vector contains non zero elements which are in order [17.0,15.0,3.0,1.0,4.0,2.0] Sparse Vector representation … Read more

Spark textFile vs wholeTextFiles

The main difference, as you mentioned, is that textFile will return an RDD with each line as an element while wholeTextFiles returns a PairRDD with the key being the file path. If there is no need to separate the data depending on the file, simply use textFile. When reading uncompressed files with textFile, it will … Read more

Schema comparison of two dataframes in scala

Based on @Derek Kaknes‘s answer, here’s the solution I came up with for comparing schemas, being concerned only about column name, datatype & nullability and indifferent to metadata import org.apache.spark.sql.DataFrame import org.apache.spark.sql.types.{DataType, StructField} def getCleanedSchema(df: DataFrame): Map[String, (DataType, Boolean)] = { df.schema.map { (structField: StructField) => structField.name.toLowerCase -> (structField.dataType, structField.nullable) }.toMap } // Compare relevant … Read more

Loaner Pattern in Scala

Make sure that whatever you compute is evaluated eagerly and no longer depends on the resource. Scala makes lazy computation fairly easy. For instance, if you wrap scala.io.Source.fromFile in this way, you might try readFile(“test.txt”)(_.getLines) Unfortunately, this doesn’t work because getLines is lazy (returns an iterator). And Scala doesn’t have any great way to indicate … Read more

Scala single method interface implementation

Scala has experimental support for SAMs starting with 2.11, under the flag -Xexperimental: Welcome to Scala version 2.11.0-RC3 (OpenJDK 64-Bit Server VM, Java 1.7.0_51). Type in expressions to have them evaluated. Type :help for more information. scala> :set -Xexperimental scala> val r: Runnable = () => println(“hello world”) r: Runnable = $anonfun$1@7861ff33 scala> new Thread(r).run … Read more

Difference between === null and isNull in Spark DataDrame

First and foremost don’t use null in your Scala code unless you really have to for compatibility reasons. Regarding your question it is plain SQL. col(“c1”) === null is interpreted as c1 = NULL and, because NULL marks undefined values, result is undefined for any value including NULL itself. spark.sql(“SELECT NULL = NULL”).show +————-+ |(NULL … Read more

scala parallel collections degree of parallelism

With the newest trunk, using the JVM 1.6 or newer, use the: collection.parallel.ForkJoinTasks.defaultForkJoinPool.setParallelism(parlevel: Int) This may be a subject to changes in the future, though. A more unified approach to configuring all Scala task parallel APIs is planned for the next releases. Note, however, that while this will determine the number of processors the query … Read more