scala - w3toppers.com

Spark ML VectorAssembler returns strange output

There is nothing strange about the output. Your vector seems to have lots of zero elements thus spark used it’s sparse representation. To explain further : It seems like your vector is composed of 18 elements (dimension). This indices [0,1,6,9,14,17] from the vector contains non zero elements which are in order [17.0,15.0,3.0,1.0,4.0,2.0] Sparse Vector representation … Read more

Spark textFile vs wholeTextFiles

The main difference, as you mentioned, is that textFile will return an RDD with each line as an element while wholeTextFiles returns a PairRDD with the key being the file path. If there is no need to separate the data depending on the file, simply use textFile. When reading uncompressed files with textFile, it will … Read more

Schema comparison of two dataframes in scala

Based on @Derek Kaknes‘s answer, here’s the solution I came up with for comparing schemas, being concerned only about column name, datatype & nullability and indifferent to metadata import org.apache.spark.sql.DataFrame import org.apache.spark.sql.types.{DataType, StructField} def getCleanedSchema(df: DataFrame): Map[String, (DataType, Boolean)] = { df.schema.map { (structField: StructField) => structField.name.toLowerCase -> (structField.dataType, structField.nullable) }.toMap } // Compare relevant … Read more

Returning original collection type in generic method

I think Miles Sabin solution is way too complex. Scala’s collection already have the necessary machinery to make it work, with a very small change: import scala.collection.TraversableLike def multiMinBy[A, B: Ordering, C <: Traversable[A]] (xs: C with TraversableLike[A, C]) (f: A => B): C = { val minVal = f(xs minBy f) xs filter (f(_) … Read more

Loaner Pattern in Scala

Make sure that whatever you compute is evaluated eagerly and no longer depends on the resource. Scala makes lazy computation fairly easy. For instance, if you wrap scala.io.Source.fromFile in this way, you might try readFile(“test.txt”)(_.getLines) Unfortunately, this doesn’t work because getLines is lazy (returns an iterator). And Scala doesn’t have any great way to indicate … Read more

Scala single method interface implementation

Scala has experimental support for SAMs starting with 2.11, under the flag -Xexperimental: Welcome to Scala version 2.11.0-RC3 (OpenJDK 64-Bit Server VM, Java 1.7.0_51). Type in expressions to have them evaluated. Type :help for more information. scala> :set -Xexperimental scala> val r: Runnable = () => println(“hello world”) r: Runnable = $anonfun$1@7861ff33 scala> new Thread(r).run … Read more

Difference between === null and isNull in Spark DataDrame

First and foremost don’t use null in your Scala code unless you really have to for compatibility reasons. Regarding your question it is plain SQL. col(“c1”) === null is interpreted as c1 = NULL and, because NULL marks undefined values, result is undefined for any value including NULL itself. spark.sql(“SELECT NULL = NULL”).show +————-+ |(NULL … Read more

How to convert an Array to a Tuple?

How about this? val pair = query.getSingleOrNone pair collect { case Array(x: MyClass1, y: MyClass2, _*) => (x,y) } // result would be Option[(MyClass1, MyClass2)]

Type inference on method return type

The return type of a method is either the type of the last statement in the block that defines it, or the type of the expression that defines it, in the absence of a block. When you use return inside a method, you introduce another statement from which the method may return. That means Scala … Read more

scala parallel collections degree of parallelism

With the newest trunk, using the JVM 1.6 or newer, use the: collection.parallel.ForkJoinTasks.defaultForkJoinPool.setParallelism(parlevel: Int) This may be a subject to changes in the future, though. A more unified approach to configuring all Scala task parallel APIs is planned for the next releases. Note, however, that while this will determine the number of processors the query … Read more