Finding duplicates from large data set using Apache Spark

Load the data into spark and apply group by on the email column.. after that check for bag and apply any distance algorithm on the first name and last name columns. This should be pretty straight forward in spark

val df = sc.textFile("hdfs path of data");

df.mapToPair("email", <whole_record>)
  .groupBy(//will be done based on key)
  .map(//will run on each bag )

More Related Contents:

How to save/insert each DStream into a permanent table
How to melt Spark DataFrame?
Using a column value as a parameter to a spark DataFrame function
Unpivot in spark-sql/pyspark
Split Spark Dataframe string column into multiple columns
Overwrite specific partitions in spark dataframe write method
Avoid performance impact of a single partition mode in Spark window functions
How to import multiple csv files in a single load?
How to access element of a VectorUDT column in a Spark DataFrame?
Spark Dataframe validating column names for parquet writes
How to check if spark dataframe is empty?
AttributeError: ‘DataFrame’ object has no attribute ‘map’
Python Spark Cumulative Sum by Group Using DataFrame
How to split a list to multiple columns in Pyspark?
pyspark dataframe filter or include based on list
How to loop through each row of dataFrame in pyspark
Fill in null with previously known good value with pyspark
PySpark – get row number for each row in a group
Save ML model for future usage
Pyspark: Pass multiple columns in UDF
Apache spark dealing with case statements
Keep only duplicates from a DataFrame regarding some field
pyspark: count distinct over a window
Serialize a custom transformer using python to be used within a Pyspark ML pipeline
How can I access python variable in Spark SQL?
How to exclude multiple columns in Spark dataframe in Python
Stratified sampling with pyspark
How to turn off scientific notation in pyspark?
Spark ALS predictAll returns empty
spark.ml StringIndexer throws ‘Unseen label’ on fit()

More Related Contents:

Leave a Comment Cancel reply