What’s the difference between Spark ML and MLLIB packages

o.a.s.mllib contains old RDD-based API while o.a.s.ml contains new API build around Dataset and ML Pipelines. ml and mllib reached feature parity in 2.0.0 and mllib is slowly being deprecated (this already happened in case of linear regression) and most likely will be removed in the next major release. So unless your goal is backward … Read more

Save ML model for future usage

Spark 2.0.0+ At first glance all Transformers and Estimators implement MLWritable with the following interface: def write: MLWriter def save(path: String): Unit and MLReadable with the following interface def read: MLReader[T] def load(path: String): T This means that you can use save method to write model to disk, for example import org.apache.spark.ml.PipelineModel val model: PipelineModel … Read more

Is Spark’s KMeans unable to handle bigdata?

I think the ‘hanging’ is because your executors keep dying. As I mentioned in a side conversation, this code runs fine for me, locally and on a cluster, in Pyspark and Scala. However, it takes a lot longer than it should. It is almost all time spent in k-means|| initialization. I opened https://issues.apache.org/jira/browse/SPARK-17389 to track … Read more

Spark MLlib LDA, how to infer the topics distribution of a new unseen document?

As of Spark 1.5 this functionality has not been implemented for the DistributedLDAModel. What you’re going to need to do is convert your model to a LocalLDAModel using the toLocal method and then call the topicDistributions(documents: RDD[(Long, Vector]) method where documents are the new (i.e. out-of-training) documents, something like this: newDocuments: RDD[(Long, Vector)] = … … Read more

Matrix Multiplication in Apache Spark [closed]

All depends on the input data and dimensions but generally speaking what you want is not a RDD but one of the distributed data structures from org.apache.spark.mllib.linalg.distributed. At this moment it provides four different implementations of the DistributedMatrix IndexedRowMatrix – can be created directly from a RDD[IndexedRow] where IndexedRow consist of row index and org.apache.spark.mllib.linalg.Vector … Read more

MatchError while accessing vector column in Spark 2.0

This has nothing to do with sparsity. Since Spark 2.0.0 ML Transformers no longer generate o.a.s.mllib.linalg.VectorUDT but o.a.s.ml.linalg.VectorUDT and are mapped locally to subclasses of o.a.s.ml.linalg.Vector. These are not compatible with old MLLib API which is moving towards deprecation in Spark 2.0.0. You can convert between to “old” using Vectors.fromML: import org.apache.spark.mllib.linalg.{Vectors => OldVectors} import … Read more