apache-spark - w3toppers.com

Why spark application fail with “executor.CoarseGrainedExecutorBackend: Driver Disassociated”?

Finally I found the reason. It is because Yarn kills the executor (container) because the executor is memory overhead. Just turn up values of spark.yarn.driver.memoryOverhead or spark.yarn.executor.memoryOverhead or both.

Stratified sampling with pyspark

The solution I suggested in Stratified sampling in Spark is pretty straightforward to convert from Scala to Python (or even to Java – What’s the easiest way to stratify a Spark Dataset ?). Nevertheless, I’ll rewrite it python. Let’s start first by creating a toy DataFrame : from pyspark.sql.functions import lit list = [(2147481832,23355149,1),(2147481832,973010692,1),(2147481832,2134870842,1),(2147481832,541023347,1),(2147481832,1682206630,1),(2147481832,1138211459,1),(2147481832,852202566,1),(2147481832,201375938,1),(2147481832,486538879,1),(2147481832,919187908,1),(214748183,919187908,1),(214748183,91187908,1)] df … Read more

Does a join of co-partitioned RDDs cause a shuffle in Apache Spark?

No. If two RDDs have the same partitioner, the join will not cause a shuffle. You can see this in CoGroupedRDD.scala: override def getDependencies: Seq[Dependency[_]] = { rdds.map { rdd: RDD[_ <: Product2[K, _]] => if (rdd.partitioner == Some(part)) { logDebug(“Adding one-to-one dependency with ” + rdd) new OneToOneDependency(rdd) } else { logDebug(“Adding shuffle dependency … Read more

What are broadcast variables? What problems do they solve?

If you have a huge array that is accessed from Spark Closures, for example, some reference data, this array will be shipped to each spark node with closure. For example, if you have 10 nodes cluster with 100 partitions (10 partitions per node), this Array will be distributed at least 100 times (10 times to … Read more

How does Distinct() function work in Spark?

.distinct() is definitely doing a shuffle across partitions. To see more of what’s happening, run a .toDebugString on your RDD. val hashPart = new HashPartitioner(<number of partitions>) val myRDDPreStep = <load some RDD> val myRDD = myRDDPreStep.distinct.partitionBy(hashPart).setName(“myRDD”).persist(StorageLevel.MEMORY_AND_DISK_SER) myRDD.checkpoint println(myRDD.toDebugString) which for an RDD example I have (myRDDPreStep is already hash-partitioned by key, persisted by StorageLevel.MEMORY_AND_DISK_SER, … Read more

How to get Kafka offsets for structured query for manual and reliable offset management?

Spark 2.2 introduced a Kafka’s structured streaming source. As I understand, it’s relying on HDFS checkpoint dir to store offsets and guarantee an “exactly-once” message delivery. Correct. Every trigger Spark Structured Streaming will save offsets to offset directory in the checkpoint location (defined using checkpointLocation option or spark.sql.streaming.checkpointLocation Spark property or randomly assigned) that is … Read more

spark createOrReplaceTempView vs createGlobalTempView

The Answer to your questions is basically understanding the difference of a Spark Application and a Spark Session. Spark application can be used: for a single batch job an interactive session with multiple jobs a long-lived server continually satisfying requests A Spark job can consist of more than just a single map and reduce. A … Read more

Spark + s3 – error – java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

Why does Spark job fail with “too many open files”?

This has been answered on the spark user list: The best way is definitely just to increase the ulimit if possible, this is sort of an assumption we make in Spark that clusters will be able to move it around. You might be able to hack around this by decreasing the number of reducers [or … Read more

Apache Spark: Differences between client and cluster deploy modes

What are the practical differences between Spark Standalone client deploy mode and cluster deploy mode? What are the pro’s and con’s of using each one? Let’s try to look at the differences between client and cluster mode. Client: Driver runs on a dedicated server (Master node) inside a dedicated process. This means it has all … Read more