Lets start with some dummy data:
val transactions = Seq((1, 2), (1, 4), (2, 3)).toDF("user_id", "category_id")
val transactions_with_counts = transactions
.groupBy($"user_id", $"category_id")
.count
transactions_with_counts.printSchema
// root
// |-- user_id: integer (nullable = false)
// |-- category_id: integer (nullable = false)
// |-- count: long (nullable = false)
There are a few ways to access Row
values and keep expected types:
-
Pattern matching
import org.apache.spark.sql.Row transactions_with_counts.map{ case Row(user_id: Int, category_id: Int, rating: Long) => Rating(user_id, category_id, rating) }
-
Typed
get*
methods likegetInt
,getLong
:transactions_with_counts.map( r => Rating(r.getInt(0), r.getInt(1), r.getLong(2)) )
-
getAs
method which can use both names and indices:transactions_with_counts.map(r => Rating( r.getAs[Int]("user_id"), r.getAs[Int]("category_id"), r.getAs[Long](2) ))
It can be used to properly extract user defined types, including
mllib.linalg.Vector
. Obviously accessing by name requires a schema. -
Converting to statically typed
Dataset
(Spark 1.6+ / 2.0+):transactions_with_counts.as[(Int, Int, Long)]