NoClassDefFoundError com.apache.hadoop.fs.FSDataInputStream when execute spark-shell

The “without Hadoop” in the Spark’s build name is misleading: it means the build is not tied to a specific Hadoop distribution, not that it is meant to run without it: the user should indicate where to find Hadoop (see https://spark.apache.org/docs/latest/hadoop-provided.html) One clean way to fix this issue is to: Obtain Hadoop Windows binaries. Ideally … Read more

Fill in null with previously known good value with pyspark

This uses last and ignores nulls. Let’s re-create something similar to the original data: import sys from pyspark.sql.window import Window import pyspark.sql.functions as func d = [{‘session’: 1, ‘ts’: 1}, {‘session’: 1, ‘ts’: 2, ‘id’: 109}, {‘session’: 1, ‘ts’: 3}, {‘session’: 1, ‘ts’: 4, ‘id’: 110}, {‘session’: 1, ‘ts’: 5}, {‘session’: 1, ‘ts’: 6}] df … Read more

Explode array data into rows in spark [duplicate]

The explode function should get that done. pyspark version: >>> df = spark.createDataFrame([(1, “A”, [1,2,3]), (2, “B”, [3,5])],[“col1”, “col2”, “col3”]) >>> from pyspark.sql.functions import explode >>> df.withColumn(“col3”, explode(df.col3)).show() +—-+—-+—-+ |col1|col2|col3| +—-+—-+—-+ | 1| A| 1| | 1| A| 2| | 1| A| 3| | 2| B| 3| | 2| B| 5| +—-+—-+—-+ Scala version scala> … Read more

How does createOrReplaceTempView work in Spark?

createOrReplaceTempView creates (or replaces if that view name already exists) a lazily evaluated “view” that you can then use like a hive table in Spark SQL. It does not persist to memory unless you cache the dataset that underpins the view. scala> val s = Seq(1,2,3).toDF(“num”) s: org.apache.spark.sql.DataFrame = [num: int] scala> s.createOrReplaceTempView(“nums”) scala> spark.table(“nums”) … Read more

How to loop through each row of dataFrame in pyspark

You simply cannot. DataFrames, same as other distributed data structures, are not iterable and can be accessed using only dedicated higher order function and / or SQL methods. You can of course collect for row in df.rdd.collect(): do_something(row) or convert toLocalIterator for row in df.rdd.toLocalIterator(): do_something(row) and iterate locally as shown above, but it beats … Read more