How to handle categorical features with spark-ml?

I just wanted to complete Holden’s answer. Since Spark 2.3.0,OneHotEncoder has been deprecated and it will be removed in 3.0.0. Please use OneHotEncoderEstimator instead. In Scala: import org.apache.spark.ml.Pipeline import org.apache.spark.ml.feature.{OneHotEncoderEstimator, StringIndexer} val df = Seq((0, “a”, 1), (1, “b”, 2), (2, “c”, 3), (3, “a”, 4), (4, “a”, 4), (5, “c”, 3)).toDF(“id”, “category1”, “category2”) val … Read more

Encode and assemble multiple features in PySpark

Spark >= 2.3, >= 3.0 Since Spark 2.3 OneHotEncoder is deprecated in favor of OneHotEncoderEstimator. If you use a recent release please modify encoder code from pyspark.ml.feature import OneHotEncoderEstimator encoder = OneHotEncoderEstimator( inputCols=[“gender_numeric”], outputCols=[“gender_vector”] ) In Spark 3.0 this variant has been renamed to OneHotEncoder: from pyspark.ml.feature import OneHotEncoder encoder = OneHotEncoder( inputCols=[“gender_numeric”], outputCols=[“gender_vector”] ) … Read more

Finding duplicates from large data set using Apache Spark

Load the data into spark and apply group by on the email column.. after that check for bag and apply any distance algorithm on the first name and last name columns. This should be pretty straight forward in spark val df = sc.textFile(“hdfs path of data”); df.mapToPair(“email”, <whole_record>) .groupBy(//will be done based on key) .map(//will … Read more