Calculate Cosine Similarity Spark Dataframe

DataFrame.rdd returns RDD[Row] not RDD[(T, U)]. You have to pattern match the Row or directly extract interesting parts.
ml Vector used with Datasets since Spark 2.0 is not the same as mllib Vector use by old API. You have to convert it to use with IndexedRowMatrix.
Index has to be Long not string.

import org.apache.spark.sql.Row

val irm = new IndexedRowMatrix(inClusters.rdd.map {
  Row(_, v: org.apache.spark.ml.linalg.Vector) => 
    org.apache.spark.mllib.linalg.Vectors.fromML(v)
}.zipWithIndex.map { case (v, i) => IndexedRow(i, v) })

More Related Contents:

Leave a Comment Cancel reply