apache-spark-mllib - w3toppers.com

What’s the difference between Spark ML and MLLIB packages

o.a.s.mllib contains old RDD-based API while o.a.s.ml contains new API build around Dataset and ML Pipelines. ml and mllib reached feature parity in 2.0.0 and mllib is slowly being deprecated (this already happened in case of linear regression) and most likely will be removed in the next major release. So unless your goal is backward … Read more

Save ML model for future usage

Spark 2.0.0+ At first glance all Transformers and Estimators implement MLWritable with the following interface: def write: MLWriter def save(path: String): Unit and MLReadable with the following interface def read: MLReader[T] def load(path: String): T This means that you can use save method to write model to disk, for example import org.apache.spark.ml.PipelineModel val model: PipelineModel … Read more

ALS model – how to generate full_u * v^t * v?

This new user is not the the matrix U, so you don’t have its latent representation in ‘k’ factors, you only know its full representation, i.e., all its ratings. full_u here means all of the new user ratings in a dense format (not the sparse format ratings are) e.g.: [0 2 0 0 0 1 … Read more

How to create a custom Estimator in PySpark

Generally speaking there is no documentation because as for Spark 1.6 / 2.0 most of the related API is not intended to be public. It should change in Spark 2.1.0 (see SPARK-7146). API is relatively complex because it has to follow specific conventions in order to make given Transformer or Estimator compatible with Pipeline API. … Read more

Is Spark’s KMeans unable to handle bigdata?

I think the ‘hanging’ is because your executors keep dying. As I mentioned in a side conversation, this code runs fine for me, locally and on a cluster, in Pyspark and Scala. However, it takes a lot longer than it should. It is almost all time spent in k-means|| initialization. I opened https://issues.apache.org/jira/browse/SPARK-17389 to track … Read more

How to get word details from TF Vector RDD in Spark ML Lib?

Well, you can’t. Since hashing is non-injective there is no inverse function. In other words infinite number of tokens can map to a single bucket so it is impossible to tell which one is actually there. If you’re using a large hash and number of unique tokens is relatively low then you can try to … Read more

Spark MLlib LDA, how to infer the topics distribution of a new unseen document?

As of Spark 1.5 this functionality has not been implemented for the DistributedLDAModel. What you’re going to need to do is convert your model to a LocalLDAModel using the toLocal method and then call the topicDistributions(documents: RDD[(Long, Vector]) method where documents are the new (i.e. out-of-training) documents, something like this: newDocuments: RDD[(Long, Vector)] = … … Read more

Matrix Multiplication in Apache Spark [closed]

All depends on the input data and dimensions but generally speaking what you want is not a RDD but one of the distributed data structures from org.apache.spark.mllib.linalg.distributed. At this moment it provides four different implementations of the DistributedMatrix IndexedRowMatrix – can be created directly from a RDD[IndexedRow] where IndexedRow consist of row index and org.apache.spark.mllib.linalg.Vector … Read more

MatchError while accessing vector column in Spark 2.0

This has nothing to do with sparsity. Since Spark 2.0.0 ML Transformers no longer generate o.a.s.mllib.linalg.VectorUDT but o.a.s.ml.linalg.VectorUDT and are mapped locally to subclasses of o.a.s.ml.linalg.Vector. These are not compatible with old MLLib API which is moving towards deprecation in Spark 2.0.0. You can convert between to “old” using Vectors.fromML: import org.apache.spark.mllib.linalg.{Vectors => OldVectors} import … Read more

AttributeError: ‘DataFrame’ object has no attribute ‘map’

You can’t map a dataframe, but you can convert the dataframe to an RDD and map that by doing spark_df.rdd.map(). Prior to Spark 2.0, spark_df.map would alias to spark_df.rdd.map(). With Spark 2.0, you must explicitly call .rdd first.