How does Distinct() function work in Spark?

.distinct() is definitely doing a shuffle across partitions. To see more of what’s happening, run a .toDebugString on your RDD. val hashPart = new HashPartitioner(<number of partitions>) val myRDDPreStep = <load some RDD> val myRDD = myRDDPreStep.distinct.partitionBy(hashPart).setName(“myRDD”).persist(StorageLevel.MEMORY_AND_DISK_SER) myRDD.checkpoint println(myRDD.toDebugString) which for an RDD example I have (myRDDPreStep is already hash-partitioned by key, persisted by StorageLevel.MEMORY_AND_DISK_SER, … Read more

Meteor: how to search for only distinct field values aka a collection.distinct(“fieldname”) similar to Mongo’s

You can just use underscore.js, which comes with Meteor – should be fine for the suggested use case. You might run into performance problems if you try this on a vast data set or run it repeatedly, but it should be adequate for common-or-garden usage: var distinctEntries = _.uniq(Collection.find({}, { sort: {myField: 1}, fields: {myField: … Read more