apache-spark - w3toppers.com

multiple conditions for filter in spark data frames

Instead of: df2 = df1.filter(“Status=2” || “Status =3″) Try: df2 = df1.filter($”Status” === 2 || $”Status” === 3)

NoClassDefFoundError com.apache.hadoop.fs.FSDataInputStream when execute spark-shell

The “without Hadoop” in the Spark’s build name is misleading: it means the build is not tied to a specific Hadoop distribution, not that it is meant to run without it: the user should indicate where to find Hadoop (see https://spark.apache.org/docs/latest/hadoop-provided.html) One clean way to fix this issue is to: Obtain Hadoop Windows binaries. Ideally … Read more

Fill in null with previously known good value with pyspark

This uses last and ignores nulls. Let’s re-create something similar to the original data: import sys from pyspark.sql.window import Window import pyspark.sql.functions as func d = [{‘session’: 1, ‘ts’: 1}, {‘session’: 1, ‘ts’: 2, ‘id’: 109}, {‘session’: 1, ‘ts’: 3}, {‘session’: 1, ‘ts’: 4, ‘id’: 110}, {‘session’: 1, ‘ts’: 5}, {‘session’: 1, ‘ts’: 6}] df … Read more

Explode array data into rows in spark [duplicate]

The explode function should get that done. pyspark version: >>> df = spark.createDataFrame([(1, “A”, [1,2,3]), (2, “B”, [3,5])],[“col1”, “col2”, “col3”]) >>> from pyspark.sql.functions import explode >>> df.withColumn(“col3”, explode(df.col3)).show() +—-+—-+—-+ |col1|col2|col3| +—-+—-+—-+ | 1| A| 1| | 1| A| 2| | 1| A| 3| | 2| B| 3| | 2| B| 5| +—-+—-+—-+ Scala version scala> … Read more

How does createOrReplaceTempView work in Spark?

createOrReplaceTempView creates (or replaces if that view name already exists) a lazily evaluated “view” that you can then use like a hive table in Spark SQL. It does not persist to memory unless you cache the dataset that underpins the view. scala> val s = Seq(1,2,3).toDF(“num”) s: org.apache.spark.sql.DataFrame = [num: int] scala> s.createOrReplaceTempView(“nums”) scala> spark.table(“nums”) … Read more

How to loop through each row of dataFrame in pyspark

You simply cannot. DataFrames, same as other distributed data structures, are not iterable and can be accessed using only dedicated higher order function and / or SQL methods. You can of course collect for row in df.rdd.collect(): do_something(row) or convert toLocalIterator for row in df.rdd.toLocalIterator(): do_something(row) and iterate locally as shown above, but it beats … Read more

Generate a Spark StructType / Schema from a case class

You can do it the same way SQLContext.createDataFrame does it: import org.apache.spark.sql.catalyst.ScalaReflection val schema = ScalaReflection.schemaFor[TestCase].dataType.asInstanceOf[StructType]

What is a task in Spark? How does the Spark worker execute the jar file?

When you create the SparkContext, each worker starts an executor. This is a separate process (JVM), and it loads your jar too. The executors connect back to your driver program. Now the driver can send them commands, like flatMap, map and reduceByKey in your example. When the driver quits, the executors shut down. RDDs are … Read more

Multiple Spark applications with HiveContext

By default Hive(Context) is using embedded Derby as a metastore. It is intended mostly for testing and supports only one active user. If you want to support multiple running applications you should configure a standalone metastore. At this moment Hive supports PostgreSQL, MySQL, Oracle and MySQL. Details of configuration depend on a backend and option … Read more

Any performance issues forcing eager evaluation using count in spark?

TL;DR 1) and 2) can be usually avoided but shouldn’t harm you (ignoring the cost of evaluation), 3) is typically a harmful Cargo cult programming practice. Without cache Calling count alone is mostly wasteful. While not always straightforward, logging can be replaced with information retrieved from listeners (here is and example for RDDs), and control … Read more