apache-spark - w3toppers.com

What will spark do if I don’t have enough memory?

I think this question has been well answered in the FAQ panel of Spark website (https://spark.apache.org/faq.html): What happens if my dataset does not fit in memory? Often each partition of data is small and does fit in memory, and these partitions are processed a few at a time. For very large partitions that do not … Read more

Understanding Spark’s caching

It would seem that Option B is required. The reason is related to how persist/cache and unpersist are executed by Spark. Since RDD transformations merely build DAG descriptions without execution, in Option A by the time you call unpersist, you still only have job descriptions and not a running execution. This is relevant because a … Read more

What is an optimized way of joining large tables in Spark SQL

My default advice on how to optimize joins is: Use a broadcast join if you can (see this notebook). From your question it seems your tables are large and a broadcast join is not an option. Consider using a very large cluster (it’s cheaper that you may think). $250 right now (6/2016) buys about 24 … Read more

PySpark error: AttributeError: ‘NoneType’ object has no attribute ‘_jvm’

Mariusz answer didn’t really help me. So if you like me found this because it’s the only result on google and you’re new to pyspark (and spark in general), here’s what worked for me. In my case I was getting that error because I was trying to execute pyspark code before the pyspark environment had … Read more

spark.ml StringIndexer throws ‘Unseen label’ on fit()

Unseen label is a generic message which doesn’t correspond to a specific column. Most likely problem is with a following stage: StringIndexer(inputCol=”lang”, outputCol=”lang_idx”) with pl-PL present in train(“lang”) and not present in test(“lang”). You can correct it using setHandleInvalid with skip: from pyspark.ml.feature import StringIndexer train = sc.parallelize([(1, “foo”), (2, “bar”)]).toDF([“k”, “v”]) test = sc.parallelize([(3, … Read more

How to add a SparkListener from pySpark in Python?

It is possible although it is a bit involved. We can use Py4j callback mechanism to pass message from a SparkListener. First lets create a Scala package with all required classes. Directory structure: . ├── build.sbt └── src └── main └── scala └── net └── zero323 └── spark └── examples └── listener ├── Listener.scala ├── … Read more

What is the concept of application, job, stage and task in spark?

The main function is the application. When you invoke an action on an RDD, a “job” is created. Jobs are work submitted to Spark. Jobs are divided into “stages” based on the shuffle boundary. This can help you understand. Each stage is further divided into tasks based on the number of partitions in the RDD. … Read more

Spark: disk I/O on stage boundaries explanation

It’s a good question in that we hear of in-memory Spark vs. Hadoop, so a little confusing. The docs are terrible, but I ran a few things and verified observations by looking around to find a most excellent source: http://hydronitrogen.com/apache-spark-shuffles-explained-in-depth.html Assuming an Action has been called – so as to avoid the obvious comment if … Read more

Why is the fold action necessary in Spark?

Empty RDD It cannot be substituted when RDD is empty: val rdd = sc.emptyRDD[Int] rdd.reduce(_ + _) // java.lang.UnsupportedOperationException: empty collection at // org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$ … rdd.fold(0)(_ + _) // Int = 0 You can of course combine reduce with condition on isEmpty but it is rather ugly. Mutable buffer Another use case for fold is … Read more

Spark SQL referencing attributes of UDT

You get this errors because schema defined by sqlType is never exposed and is not intended to be accessed directly. It simply provides a way to express a complex data types using native Spark SQL types. You can access individual attributes using UDFs but first lets show that the internal structure is indeed not exposed: … Read more