Random numbers generation in PySpark

So the actual problem here is relatively simple. Each subprocess in Python inherits its state from its parent: len(set(sc.parallelize(range(4), 4).map(lambda _: random.getstate()).collect())) # 1 Since parent state has no reason to change in this particular scenario and workers have a limited lifespan, state of every child will be exactly the same on each run.

Spark ML VectorAssembler returns strange output

There is nothing strange about the output. Your vector seems to have lots of zero elements thus spark used it’s sparse representation. To explain further : It seems like your vector is composed of 18 elements (dimension). This indices [0,1,6,9,14,17] from the vector contains non zero elements which are in order [17.0,15.0,3.0,1.0,4.0,2.0] Sparse Vector representation … Read more

Spark textFile vs wholeTextFiles

The main difference, as you mentioned, is that textFile will return an RDD with each line as an element while wholeTextFiles returns a PairRDD with the key being the file path. If there is no need to separate the data depending on the file, simply use textFile. When reading uncompressed files with textFile, it will … Read more

Difference between === null and isNull in Spark DataDrame

First and foremost don’t use null in your Scala code unless you really have to for compatibility reasons. Regarding your question it is plain SQL. col(“c1”) === null is interpreted as c1 = NULL and, because NULL marks undefined values, result is undefined for any value including NULL itself. spark.sql(“SELECT NULL = NULL”).show +————-+ |(NULL … Read more

Filter spark DataFrame on string contains

You can use contains (this works with an arbitrary sequence): df.filter($”foo”.contains(“bar”)) like (SQL like with SQL simple regular expression whith _ matching an arbitrary character and % matching an arbitrary sequence): df.filter($”foo”.like(“bar”)) or rlike (like with Java regular expressions): df.filter($”foo”.rlike(“bar”)) depending on your requirements. LIKE and RLIKE should work with SQL expressions as well.

Understanding Spark’s caching

It would seem that Option B is required. The reason is related to how persist/cache and unpersist are executed by Spark. Since RDD transformations merely build DAG descriptions without execution, in Option A by the time you call unpersist, you still only have job descriptions and not a running execution. This is relevant because a … Read more