apache-spark-mllib - w3toppers.com

How to handle categorical features with spark-ml?

I just wanted to complete Holden’s answer. Since Spark 2.3.0,OneHotEncoder has been deprecated and it will be removed in 3.0.0. Please use OneHotEncoderEstimator instead. In Scala: import org.apache.spark.ml.Pipeline import org.apache.spark.ml.feature.{OneHotEncoderEstimator, StringIndexer} val df = Seq((0, “a”, 1), (1, “b”, 2), (2, “c”, 3), (3, “a”, 4), (4, “a”, 4), (5, “c”, 3)).toDF(“id”, “category1”, “category2”) val … Read more

Encode and assemble multiple features in PySpark

Spark >= 2.3, >= 3.0 Since Spark 2.3 OneHotEncoder is deprecated in favor of OneHotEncoderEstimator. If you use a recent release please modify encoder code from pyspark.ml.feature import OneHotEncoderEstimator encoder = OneHotEncoderEstimator( inputCols=[“gender_numeric”], outputCols=[“gender_vector”] ) In Spark 3.0 this variant has been renamed to OneHotEncoder: from pyspark.ml.feature import OneHotEncoder encoder = OneHotEncoder( inputCols=[“gender_numeric”], outputCols=[“gender_vector”] ) … Read more

How to serve a Spark MLlib model?

From one hand, a machine learning model built with spark can’t be served the way you serve in Azure ML or Amazon ML in a traditional manner. Databricks claims to be able to deploy models using it’s notebook but I haven’t actually tried that yet. On other hand, you can use a model in three … Read more

Calling Java/Scala function from a task

Communication using default Py4J gateway is simply not possible. To understand why we have to take a look at the following diagram from the PySpark Internals document [1]: Since Py4J gateway runs on the driver it is not accessible to Python interpreters which communicate with JVM workers through sockets (See for example PythonRDD / rdd.py). … Read more

Finding duplicates from large data set using Apache Spark

Load the data into spark and apply group by on the email column.. after that check for bag and apply any distance algorithm on the first name and last name columns. This should be pretty straight forward in spark val df = sc.textFile(“hdfs path of data”); df.mapToPair(“email”, <whole_record>) .groupBy(//will be done based on key) .map(//will … Read more