pyspark - w3toppers.com

Spark Equivalent of IF Then ELSE

Correct structure is either: (when(col(“iris_class”) == ‘Iris-setosa’, 0) .when(col(“iris_class”) == ‘Iris-versicolor’, 1) .otherwise(2)) which is equivalent to CASE WHEN (iris_class=”Iris-setosa”) THEN 0 WHEN (iris_class=”Iris-versicolor”) THEN 1 ELSE 2 END or: (when(col(“iris_class”) == ‘Iris-setosa’, 0) .otherwise(when(col(“iris_class”) == ‘Iris-versicolor’, 1) .otherwise(2))) which is equivalent to: CASE WHEN (iris_class=”Iris-setosa”) THEN 0 ELSE CASE WHEN (iris_class=”Iris-versicolor”) THEN 1 ELSE … Read more

Using a column value as a parameter to a spark DataFrame function

One option is to use pyspark.sql.functions.expr, which allows you to use columns values as inputs to spark-sql functions. Based on @user8371915’s comment I have found that the following works: from pyspark.sql.functions import expr df.select( ‘*’, expr(‘posexplode(split(repeat(“,”, rpt), “,”))’).alias(“index”, “col”) ).where(‘index > 0’).drop(“col”).sort(‘letter’, ‘index’).show() #+——+—+—–+ #|letter|rpt|index| #+——+—+—–+ #| X| 3| 1| #| X| 3| 2| #| … Read more

Convert pyspark string to date format

Update (1/10/2018): For Spark 2.2+ the best way to do this is probably using the to_date or to_timestamp functions, which both support the format argument. From the docs: >>> from pyspark.sql.functions import to_timestamp >>> df = spark.createDataFrame([(‘1997-02-28 10:30:00’,)], [‘t’]) >>> df.select(to_timestamp(df.t, ‘yyyy-MM-dd HH:mm:ss’).alias(‘dt’)).collect() [Row(dt=datetime.datetime(1997, 2, 28, 10, 30))] Original Answer (for Spark < 2.2) It … Read more

Spark functions vs UDF performance?

when would a udf be faster If you ask about Python UDF the answer is probably never*. Since SQL functions are relatively simple and are not designed for complex tasks it is pretty much impossible compensate the cost of repeated serialization, deserialization and data movement between Python interpreter and JVM. Does anyone know why this … Read more

How to melt Spark DataFrame?

There is no built-in function (if you work with SQL and Hive support enabled you can use stack function, but it is not exposed in Spark and has no native implementation) but it is trivial to roll your own. Required imports: from pyspark.sql.functions import array, col, explode, lit, struct from pyspark.sql import DataFrame from typing … Read more

How to pivot Spark DataFrame?

As mentioned by David Anderson Spark provides pivot function since version 1.6. General syntax looks as follows: df .groupBy(grouping_columns) .pivot(pivot_column, [values]) .agg(aggregate_expressions) Usage examples using nycflights13 and csv format: Python: from pyspark.sql.functions import avg flights = (sqlContext .read .format(“csv”) .options(inferSchema=”true”, header=”true”) .load(“flights.csv”) .na.drop()) flights.registerTempTable(“flights”) sqlContext.cacheTable(“flights”) gexprs = (“origin”, “dest”, “carrier”) aggexpr = avg(“arr_delay”) flights.count() ## … Read more

How to make good reproducible Apache Spark examples

Provide small sample data, that can be easily recreated. At the very least, posters should provide a couple of rows and columns on their dataframe and code that can be used to easily create it. By easy, I mean cut and paste. Make it as small as possible to demonstrate your problem. I have the … Read more

Finding duplicates from large data set using Apache Spark

Load the data into spark and apply group by on the email column.. after that check for bag and apply any distance algorithm on the first name and last name columns. This should be pretty straight forward in spark val df = sc.textFile(“hdfs path of data”); df.mapToPair(“email”, <whole_record>) .groupBy(//will be done based on key) .map(//will … Read more