Spark Datasets
require Encoders
for data type which is about to be stored. For common types (atomics, product types) there is a number of predefined encoders available but you have to import these first from SparkSession.implicits
to make it work:
val sparkSession: SparkSession = ???
import sparkSession.implicits._
val dataset = sparkSession.createDataset(dataList)
Alternatively you can provide directly an explicit
import org.apache.spark.sql.{Encoder, Encoders}
val dataset = sparkSession.createDataset(dataList)(Encoders.product[SimpleTuple])
or implicit
implicit val enc: Encoder[SimpleTuple] = Encoders.product[SimpleTuple]
val dataset = sparkSession.createDataset(dataList)
Encoder
for the stored type.
Note that Encoders
also provide a number of predefined Encoders
for atomic types, and Encoders
for complex ones, can derived with ExpressionEncoder
.
Further reading:
- For custom objects which are not covered by built-in encoders see How to store custom objects in Dataset?
- For
Row
objects you have to provideEncoder
explicitly as shown in Encoder error while trying to map dataframe row to updated row - For debug cases, case class must be defined outside of the Main https://stackoverflow.com/a/34715827/3535853