Split Spark Dataframe string column into multiple columns

pyspark.sql.functions.split() is the right approach here – you simply need to flatten the nested ArrayType column into multiple top-level columns. In this case, where each array only contains 2 items, it’s very easy. You simply use Column.getItem() to retrieve each part of the array as a column itself:

split_col = pyspark.sql.functions.split(df['my_str_col'], '-')
df = df.withColumn('NAME1', split_col.getItem(0))
df = df.withColumn('NAME2', split_col.getItem(1))

The result will be:

col1 | my_str_col | NAME1 | NAME2
-----+------------+-------+------
  18 |  856-yygrm |   856 | yygrm
 201 |  777-psgdg |   777 | psgdg

I am not sure how I would solve this in a general case where the nested arrays were not the same size from Row to Row.

Leave a Comment