Finding duplicates from large data set using Apache Spark

Load the data into spark and apply group by on the email column.. after that check for bag and apply any distance algorithm on the first name and last name columns. This should be pretty straight forward in spark

val df = sc.textFile("hdfs path of data");

df.mapToPair("email", <whole_record>)
  .groupBy(//will be done based on key)
  .map(//will run on each bag )

Leave a Comment