Spark ML VectorAssembler returns strange output

There is nothing strange about the output. Your vector seems to have lots of zero elements thus spark used it’s sparse representation. To explain further : It seems like your vector is composed of 18 elements (dimension). This indices [0,1,6,9,14,17] from the vector contains non zero elements which are in order [17.0,15.0,3.0,1.0,4.0,2.0] Sparse Vector representation … Read more

Spark DataFrames when udf functions do not accept large enough input variables

User defined functions are defined for up to 22 parameters. Only udf helper is define for at most 10 arguments. To handle functions with larger number of parameters you can use org.apache.spark.sql.UDFRegistration. For example val dummy = (( x0: Int, x1: Int, x2: Int, x3: Int, x4: Int, x5: Int, x6: Int, x7: Int, x8: … Read more

How to vectorize DataFrame columns for ML algorithms?

You can simply foldLeft over the Array of columns: val transformed: DataFrame = df.columns.foldLeft(df)((df, arg) => str(arg, df)) Still, I will argue that it is not a good approach. Since src discards StringIndexerModel it cannot be used when you get new data. Because of that I would recommend using Pipeline: import org.apache.spark.ml.Pipeline val transformers: Array[org.apache.spark.ml.PipelineStage] … Read more

Spark mllib predicting weird number or NaN

The problem is that LinearRegressionWithSGD uses stochastic gradient descent (SGD) to optimize the weight vector of your linear model. SGD is really sensitive to the provided stepSize which is used to update the intermediate solution. What SGD does is to calculate the gradient g of the cost function given a sample of the input points … Read more

Serialize a custom transformer using python to be used within a Pyspark ML pipeline

As of Spark 2.3.0 there’s a much, much better way to do this. Simply extend DefaultParamsWritable and DefaultParamsReadable and your class will automatically have write and read methods that will save your params and will be used by the PipelineModel serialization system. The docs were not really clear, and I had to do a bit … Read more