apache-spark - w3toppers.com

Random numbers generation in PySpark

So the actual problem here is relatively simple. Each subprocess in Python inherits its state from its parent: len(set(sc.parallelize(range(4), 4).map(lambda _: random.getstate()).collect())) # 1 Since parent state has no reason to change in this particular scenario and workers have a limited lifespan, state of every child will be exactly the same on each run.

Spark ML VectorAssembler returns strange output

There is nothing strange about the output. Your vector seems to have lots of zero elements thus spark used it’s sparse representation. To explain further : It seems like your vector is composed of 18 elements (dimension). This indices [0,1,6,9,14,17] from the vector contains non zero elements which are in order [17.0,15.0,3.0,1.0,4.0,2.0] Sparse Vector representation … Read more

Spark textFile vs wholeTextFiles

The main difference, as you mentioned, is that textFile will return an RDD with each line as an element while wholeTextFiles returns a PairRDD with the key being the file path. If there is no need to separate the data depending on the file, simply use textFile. When reading uncompressed files with textFile, it will … Read more

What will spark do if I don’t have enough memory?

I think this question has been well answered in the FAQ panel of Spark website (https://spark.apache.org/faq.html): What happens if my dataset does not fit in memory? Often each partition of data is small and does fit in memory, and these partitions are processed a few at a time. For very large partitions that do not … Read more

What is the difference between spark-submit and pyspark?

If you built a spark application, you need to use spark-submit to run the application The code can be written either in python/scala The mode can be either local/cluster If you just want to test/run few individual commands, you can use the shell provided by spark pyspark (for spark in python) spark-shell (for spark in … Read more

Difference between === null and isNull in Spark DataDrame

First and foremost don’t use null in your Scala code unless you really have to for compatibility reasons. Regarding your question it is plain SQL. col(“c1”) === null is interpreted as c1 = NULL and, because NULL marks undefined values, result is undefined for any value including NULL itself. spark.sql(“SELECT NULL = NULL”).show +————-+ |(NULL … Read more

how to add Row id in pySpark dataframes [duplicate]

You can use also use a function from sql package. It will generate a unique id, however it will not be sequential as it depends on the number of partitions. I believe it is available in Spark 1.5 + from pyspark.sql.functions import monotonicallyIncreasingId # This will return a new DF with all the columns + … Read more

Calculating the averages for each KEY in a Pairwise (K,V) RDD in Spark with Python

Now a much better way to do this is to use the rdd.aggregateByKey() method. Because this method is so poorly documented in the Apache Spark with Python documentation — and is why I wrote this Q&A — until recently I had been using the above code sequence. But again, it’s less efficient, so avoid doing … Read more

Filter spark DataFrame on string contains

You can use contains (this works with an arbitrary sequence): df.filter($”foo”.contains(“bar”)) like (SQL like with SQL simple regular expression whith _ matching an arbitrary character and % matching an arbitrary sequence): df.filter($”foo”.like(“bar”)) or rlike (like with Java regular expressions): df.filter($”foo”.rlike(“bar”)) depending on your requirements. LIKE and RLIKE should work with SQL expressions as well.

Understanding Spark’s caching

It would seem that Option B is required. The reason is related to how persist/cache and unpersist are executed by Spark. Since RDD transformations merely build DAG descriptions without execution, in Option A by the time you call unpersist, you still only have job descriptions and not a running execution. This is relevant because a … Read more