Encode and assemble multiple features in PySpark

Spark >= 2.3, >= 3.0 Since Spark 2.3 OneHotEncoder is deprecated in favor of OneHotEncoderEstimator. If you use a recent release please modify encoder code from pyspark.ml.feature import OneHotEncoderEstimator encoder = OneHotEncoderEstimator( inputCols=[“gender_numeric”], outputCols=[“gender_vector”] ) In Spark 3.0 this variant has been renamed to OneHotEncoder: from pyspark.ml.feature import OneHotEncoder encoder = OneHotEncoder( inputCols=[“gender_numeric”], outputCols=[“gender_vector”] ) … Read more

Spark Scala: How to convert Dataframe[vector] to DataFrame[f1:Double, …, fn: Double)]

Spark >= 3.0.0 Since Spark 3.0 you can use vector_to_array import org.apache.spark.ml.functions.vector_to_array testDF.select(vector_to_array($”scaledFeatures”).alias(“_tmp”)).select(exprs:_*) Spark < 3.0.0 One possible approach is something similar to this import org.apache.spark.sql.functions.udf // In Spark 1.x you’ll will have to replace ML Vector with MLLib one // import org.apache.spark.mllib.linalg.Vector // In 2.x the below is usually the right choice import org.apache.spark.ml.linalg.Vector … Read more

How to split Vector into columns – using PySpark

Spark >= 3.0.0 Since Spark 3.0.0 this can be done without using UDF. from pyspark.ml.functions import vector_to_array (df .withColumn(“xs”, vector_to_array(“vector”))) .select([“word”] + [col(“xs”)[i] for i in range(3)])) ## +——-+—–+—–+—–+ ## | word|xs[0]|xs[1]|xs[2]| ## +——-+—–+—–+—–+ ## | assert| 1.0| 2.0| 3.0| ## |require| 0.0| 2.0| 0.0| ## +——-+—–+—–+—–+ Spark < 3.0.0 One possible approach is to … Read more