DataFrame.rdd
returns RDD[Row]
not RDD[(T, U)]
. You have to pattern match the Row
or directly extract interesting parts.
ml
Vector
used with Datasets
since Spark 2.0 is not the same as mllib
Vector
use by old API. You have to convert it to use with IndexedRowMatrix
.
- Index has to be
Long
not string.
import org.apache.spark.sql.Row
val irm = new IndexedRowMatrix(inClusters.rdd.map {
Row(_, v: org.apache.spark.ml.linalg.Vector) =>
org.apache.spark.mllib.linalg.Vectors.fromML(v)
}.zipWithIndex.map { case (v, i) => IndexedRow(i, v) })