pyspark-dataframes - w3toppers.com

pySpark mapping multiple columns

From my understanding, you can create a map based on columns from reference_df (I assumed this is not a very big dataframe): map_key = concat_ws(‘\0’, PrimaryLookupAttributeName, PrimaryLookupAttributeValue) map_value = OutputItemNameByValue and then use this mapping to get the corresponding values in df1: from itertools import chain from pyspark.sql.functions import collect_set, array, concat_ws, lit, col, create_map … Read more

How to perform union on two DataFrames with different amounts of columns in spark?

In Scala you just have to append all missing columns as nulls. import org.apache.spark.sql.functions._ // let df1 and df2 the Dataframes to merge val df1 = sc.parallelize(List( (50, 2), (34, 4) )).toDF(“age”, “children”) val df2 = sc.parallelize(List( (26, true, 60000.00), (32, false, 35000.00) )).toDF(“age”, “education”, “income”) val cols1 = df1.columns.toSet val cols2 = df2.columns.toSet val … Read more