It depends on the type of your “list”:
-
If it is of type
ArrayType()
:df = hc.createDataFrame(sc.parallelize([['a', [1,2,3]], ['b', [2,3,4]]]), ["key", "value"]) df.printSchema() df.show() root |-- key: string (nullable = true) |-- value: array (nullable = true) | |-- element: long (containsNull = true)
you can access the values like you would with python using
[]
:df.select("key", df.value[0], df.value[1], df.value[2]).show() +---+--------+--------+--------+ |key|value[0]|value[1]|value[2]| +---+--------+--------+--------+ | a| 1| 2| 3| | b| 2| 3| 4| +---+--------+--------+--------+ +---+-------+ |key| value| +---+-------+ | a|[1,2,3]| | b|[2,3,4]| +---+-------+
-
If it is of type
StructType()
: (maybe you built your dataframe by reading a JSON)df2 = df.select("key", psf.struct( df.value[0].alias("value1"), df.value[1].alias("value2"), df.value[2].alias("value3") ).alias("value")) df2.printSchema() df2.show() root |-- key: string (nullable = true) |-- value: struct (nullable = false) | |-- value1: long (nullable = true) | |-- value2: long (nullable = true) | |-- value3: long (nullable = true) +---+-------+ |key| value| +---+-------+ | a|[1,2,3]| | b|[2,3,4]| +---+-------+
you can directly ‘split’ the column using
*
:df2.select('key', 'value.*').show() +---+------+------+------+ |key|value1|value2|value3| +---+------+------+------+ | a| 1| 2| 3| | b| 2| 3| 4| +---+------+------+------+