Spark ML VectorAssembler returns strange output

There is nothing strange about the output. Your vector seems to have lots of zero elements thus spark used it’s sparse representation. To explain further : It seems like your vector is composed of 18 elements (dimension). This indices [0,1,6,9,14,17] from the vector contains non zero elements which are in order [17.0,15.0,3.0,1.0,4.0,2.0] Sparse Vector representation … Read more

spark.ml StringIndexer throws ‘Unseen label’ on fit()

Unseen label is a generic message which doesn’t correspond to a specific column. Most likely problem is with a following stage: StringIndexer(inputCol=”lang”, outputCol=”lang_idx”) with pl-PL present in train(“lang”) and not present in test(“lang”). You can correct it using setHandleInvalid with skip: from pyspark.ml.feature import StringIndexer train = sc.parallelize([(1, “foo”), (2, “bar”)]).toDF([“k”, “v”]) test = sc.parallelize([(3, … Read more

How to vectorize DataFrame columns for ML algorithms?

You can simply foldLeft over the Array of columns: val transformed: DataFrame = df.columns.foldLeft(df)((df, arg) => str(arg, df)) Still, I will argue that it is not a good approach. Since src discards StringIndexerModel it cannot be used when you get new data. Because of that I would recommend using Pipeline: import org.apache.spark.ml.Pipeline val transformers: Array[org.apache.spark.ml.PipelineStage] … Read more

How do I convert an array (i.e. list) column to Vector

Personally I would go with Python UDF and wouldn’t bother with anything else: Vectors are not native SQL types so there will be performance overhead one way or another. In particular this process requires two steps where data is first converted from external type to row, and then from row to internal representation using generic … Read more

Serialize a custom transformer using python to be used within a Pyspark ML pipeline

As of Spark 2.3.0 there’s a much, much better way to do this. Simply extend DefaultParamsWritable and DefaultParamsReadable and your class will automatically have write and read methods that will save your params and will be used by the PipelineModel serialization system. The docs were not really clear, and I had to do a bit … Read more

What’s the difference between Spark ML and MLLIB packages

o.a.s.mllib contains old RDD-based API while o.a.s.ml contains new API build around Dataset and ML Pipelines. ml and mllib reached feature parity in 2.0.0 and mllib is slowly being deprecated (this already happened in case of linear regression) and most likely will be removed in the next major release. So unless your goal is backward … Read more

Save ML model for future usage

Spark 2.0.0+ At first glance all Transformers and Estimators implement MLWritable with the following interface: def write: MLWriter def save(path: String): Unit and MLReadable with the following interface def read: MLReader[T] def load(path: String): T This means that you can use save method to write model to disk, for example import org.apache.spark.ml.PipelineModel val model: PipelineModel … Read more