Spark Scala: How to convert Dataframe[vector] to DataFrame[f1:Double, …, fn: Double)]

Spark >= 3.0.0

Since Spark 3.0 you can use vector_to_array

import org.apache.spark.ml.functions.vector_to_array

testDF.select(vector_to_array($"scaledFeatures").alias("_tmp")).select(exprs:_*)

Spark < 3.0.0

One possible approach is something similar to this

import org.apache.spark.sql.functions.udf

// In Spark 1.x you'll will have to replace ML Vector with MLLib one
// import org.apache.spark.mllib.linalg.Vector
// In 2.x the below is usually the right choice
import org.apache.spark.ml.linalg.Vector

// Get size of the vector
val n = testDF.first.getAs[Vector](0).size

// Simple helper to convert vector to array<double> 
// asNondeterministic is available in Spark 2.3 or befor
// It can be removed, but at the cost of decreased performance
val vecToSeq = udf((v: Vector) => v.toArray).asNondeterministic

// Prepare a list of columns to create
val exprs = (0 until n).map(i => $"_tmp".getItem(i).alias(s"f$i"))

testDF.select(vecToSeq($"scaledFeatures").alias("_tmp")).select(exprs:_*)

If you know a list of columns upfront you can simplify this a little:

val cols: Seq[String] = ???
val exprs = cols.zipWithIndex.map{ case (c, i) => $"_tmp".getItem(i).alias(c) }

For Python equivalent see How to split Vector into columns – using PySpark.

Leave a Comment