How can I change column types in Spark SQL’s DataFrame?

Edit: Newest newest version

Since spark 2.x you should use dataset api instead when using Scala [1]. Check docs here:

https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html#withColumn(colName:String,col:org.apache.spark.sql.Column):org.apache.spark.sql.DataFrame

If working with python, even though easier, I leave the link here as it’s a very highly voted question:

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.withColumn.html

>>> df.withColumn('age2', df.age + 2).collect()
[Row(age=2, name="Alice", age2=4), Row(age=5, name="Bob", age2=7)]

[1] https://spark.apache.org/docs/latest/sql-programming-guide.html:

In the Scala API, DataFrame is simply a type alias of Dataset[Row].
While, in Java API, users need to use Dataset to represent a
DataFrame.

Edit: Newest version

Since spark 2.x you can use .withColumn. Check the docs here:

https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.Dataset@withColumn(colName:String,col:org.apache.spark.sql.Column):org.apache.spark.sql.DataFrame

Oldest answer

Since Spark version 1.4 you can apply the cast method with DataType on the column:

import org.apache.spark.sql.types.IntegerType
val df2 = df.withColumn("yearTmp", df.year.cast(IntegerType))
    .drop("year")
    .withColumnRenamed("yearTmp", "year")

If you are using sql expressions you can also do:

val df2 = df.selectExpr("cast(year as int) year", 
                        "make", 
                        "model", 
                        "comment", 
                        "blank")

For more info check the docs:
http://spark.apache.org/docs/1.6.0/api/scala/#org.apache.spark.sql.DataFrame

Leave a Comment