rdd - w3toppers.com

How to get element by Index in Spark RDD (Java)

This should be possible by first indexing the RDD. The transformation zipWithIndex provides a stable indexing, numbering each element in its original order. Given: rdd = (a,b,c) val withIndex = rdd.zipWithIndex // ((a,0),(b,1),(c,2)) To lookup an element by index, this form is not useful. First we need to use the index as key: val indexKey … Read more

How spark read a large file (petabyte) when file can not be fit in spark’s main memory

First of all, Spark only starts reading in the data when an action (like count, collect or write) is called. Once an action is called, Spark loads in data in partitions – the number of concurrently loaded partitions depend on the number of cores you have available. So in Spark you can think of 1 … Read more

Why does partition parameter of SparkContext.textFile not take effect?

If you take a look at the signature textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] you’ll see that the argument you use is called minPartitions and this pretty much describes its function. In some cases even that is ignored but it is a different matter. Input format which is used behind the scenes still decides … Read more

What is the difference between cache and persist?

With cache(), you use only the default storage level : MEMORY_ONLY for RDD MEMORY_AND_DISK for Dataset With persist(), you can specify which storage level you want for both RDD and Dataset. From the official docs: You can mark an RDD to be persisted using the persist() or cache() methods on it. each persisted RDD can … Read more

Explanation of fold method of spark RDD

Well, it is actually pretty well explained by the official documentation: Aggregate the elements of each partition, and then the results for all the partitions, using a given associative and commutative function and a neutral “zero value”. The function op(t1, t2) is allowed to modify t1 and return it as its result value to avoid … Read more

Apache spark dealing with case statements

These are few ways to write If-Else / When-Then-Else / When-Otherwise expression in pyspark. Sample dataframe df = spark.createDataFrame([(1,1),(2,2),(3,3)],[‘id’,’value’]) df.show() #+—+—–+ #| id|value| #+—+—–+ #| 1| 1| #| 2| 2| #| 3| 3| #+—+—–+ #Desired Output: #+—+—–+———-+ #| id|value|value_desc| #+—+—–+———-+ #| 1| 1| one| #| 2| 2| two| #| 3| 3| other| #+—+—–+———-+ Option#1: withColumn() … Read more

Reduce a key-value pair into a key-list pair with Apache Spark

Map and ReduceByKey Input type and output type of reduce must be the same, therefore if you want to aggregate a list, you have to map the input to lists. Afterwards you combine the lists into one list. Combining lists You’ll need a method to combine lists into one list. Python provides some methods to … Read more

PySpark DataFrames – way to enumerate without converting to Pandas?

It doesn’t work because: the second argument for withColumn should be a Column not a collection. np.array won’t work here when you pass “index in indexes” as a SQL expression to where indexes is out of scope and it is not resolved as a valid identifier PySpark >= 1.4.0 You can add row numbers using … Read more

Why does sortBy transformation trigger a Spark job?

sortBy is implemented using sortByKey which depends on a RangePartitioner (JVM) or partitioning function (Python). When you call sortBy / sortByKey partitioner (partitioning function) is initialized eagerly and samples input RDD to compute partition boundaries. Job you see corresponds to this process. Actual sorting is performed only if you execute an action on the newly … Read more

Apache Spark: What is the equivalent implementation of RDD.groupByKey() using RDD.aggregateByKey()?

As far as I can tell there is nothing to gain* in this particular case by using aggregateByKey or a similar function. Since you’re building a list there is no “real” reduction and amount of data which has to be shuffled is more or less the same. To really observe some performance gain you need … Read more