Spark Equivalent of IF Then ELSE

Correct structure is either: (when(col(“iris_class”) == ‘Iris-setosa’, 0) .when(col(“iris_class”) == ‘Iris-versicolor’, 1) .otherwise(2)) which is equivalent to CASE WHEN (iris_class=”Iris-setosa”) THEN 0 WHEN (iris_class=”Iris-versicolor”) THEN 1 ELSE 2 END or: (when(col(“iris_class”) == ‘Iris-setosa’, 0) .otherwise(when(col(“iris_class”) == ‘Iris-versicolor’, 1) .otherwise(2))) which is equivalent to: CASE WHEN (iris_class=”Iris-setosa”) THEN 0 ELSE CASE WHEN (iris_class=”Iris-versicolor”) THEN 1 ELSE … Read more

Using a column value as a parameter to a spark DataFrame function

One option is to use pyspark.sql.functions.expr, which allows you to use columns values as inputs to spark-sql functions. Based on @user8371915’s comment I have found that the following works: from pyspark.sql.functions import expr df.select( ‘*’, expr(‘posexplode(split(repeat(“,”, rpt), “,”))’).alias(“index”, “col”) ).where(‘index > 0’).drop(“col”).sort(‘letter’, ‘index’).show() #+——+—+—–+ #|letter|rpt|index| #+——+—+—–+ #| X| 3| 1| #| X| 3| 2| #| … Read more

Convert pyspark string to date format

Update (1/10/2018): For Spark 2.2+ the best way to do this is probably using the to_date or to_timestamp functions, which both support the format argument. From the docs: >>> from pyspark.sql.functions import to_timestamp >>> df = spark.createDataFrame([(‘1997-02-28 10:30:00’,)], [‘t’]) >>> df.select(to_timestamp(df.t, ‘yyyy-MM-dd HH:mm:ss’).alias(‘dt’)).collect() [Row(dt=datetime.datetime(1997, 2, 28, 10, 30))] Original Answer (for Spark < 2.2) It … Read more

Spark functions vs UDF performance?

when would a udf be faster If you ask about Python UDF the answer is probably never*. Since SQL functions are relatively simple and are not designed for complex tasks it is pretty much impossible compensate the cost of repeated serialization, deserialization and data movement between Python interpreter and JVM. Does anyone know why this … Read more

How to pivot Spark DataFrame?

As mentioned by David Anderson Spark provides pivot function since version 1.6. General syntax looks as follows: df .groupBy(grouping_columns) .pivot(pivot_column, [values]) .agg(aggregate_expressions) Usage examples using nycflights13 and csv format: Python: from pyspark.sql.functions import avg flights = (sqlContext .read .format(“csv”) .options(inferSchema=”true”, header=”true”) .load(“flights.csv”) .na.drop()) flights.registerTempTable(“flights”) sqlContext.cacheTable(“flights”) gexprs = (“origin”, “dest”, “carrier”) aggexpr = avg(“arr_delay”) flights.count() ## … Read more

Finding duplicates from large data set using Apache Spark

Load the data into spark and apply group by on the email column.. after that check for bag and apply any distance algorithm on the first name and last name columns. This should be pretty straight forward in spark val df = sc.textFile(“hdfs path of data”); df.mapToPair(“email”, <whole_record>) .groupBy(//will be done based on key) .map(//will … Read more