How to do aggregate functions, may columns and extract back

If you are going to make all the groupBy with the same columns, and you want to create all the aggregation column references as avg(col_interested ).as(col_interested_avg) with all the elements of columns_interestedList you can create all the references with the stream, and pass them to the gag method. List<Column> avgCols = columns_interestedList.stream() .map(col_interested -> avg(col_interested).as(col_interested … Read more

How to bucket the range of values from a column and count how many values fall into each interval in scala?

You can use the scala Bucketizer. There’s a good example here: https://spark.apache.org/docs/2.2.0/ml-features.html#bucketizer After you use the bucketizer you have a dataframe with a bucket index (i.e index 1, 2, and 3 might correspond to values 1-5, 6-10, 11-15, respectively). You can do a .groupBy and .agg (or use SQL) to get a count of records … Read more

Finding duplicates from large data set using Apache Spark

Load the data into spark and apply group by on the email column.. after that check for bag and apply any distance algorithm on the first name and last name columns. This should be pretty straight forward in spark val df = sc.textFile(“hdfs path of data”); df.mapToPair(“email”, <whole_record>) .groupBy(//will be done based on key) .map(//will … Read more