Based on @Derek Kaknes‘s answer, here’s the solution I came up with for comparing schemas, being concerned only about column name, datatype & nullability and indifferent to metadata
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types.{DataType, StructField}
def getCleanedSchema(df: DataFrame): Map[String, (DataType, Boolean)] = {
df.schema.map { (structField: StructField) =>
structField.name.toLowerCase -> (structField.dataType, structField.nullable)
}.toMap
}
// Compare relevant information
def getSchemaDifference(schema1: Map[String, (DataType, Boolean)],
schema2: Map[String, (DataType, Boolean)]
): Map[String, (Option[(DataType, Boolean)], Option[(DataType, Boolean)])] = {
(schema1.keys ++ schema2.keys).
map(_.toLowerCase).
toList.distinct.
flatMap { (columnName: String) =>
val schema1FieldOpt: Option[(DataType, Boolean)] = schema1.get(columnName)
val schema2FieldOpt: Option[(DataType, Boolean)] = schema2.get(columnName)
if (schema1FieldOpt == schema2FieldOpt) None
else Some(columnName -> (schema1FieldOpt, schema2FieldOpt))
}.toMap
}
-
getCleanedSchema
method extracts information of interest – column datatype & nullability and returns amap
of column name totuple
-
getSchemaDifference
method returns amap
containing only those columns that differ in the two schemas. If a column is absent in one of the two schemas, then it’s corresponding properties would beNone