Replace missing values with mean – Spark Dataframe

Spark >= 2.2 You can use org.apache.spark.ml.feature.Imputer (which supports both mean and median strategy). Scala : import org.apache.spark.ml.feature.Imputer val imputer = new Imputer() .setInputCols(df.columns) .setOutputCols(df.columns.map(c => s”${c}_imputed”)) .setStrategy(“mean”) imputer.fit(df).transform(df) Python: from pyspark.ml.feature import Imputer imputer = Imputer( inputCols=df.columns, outputCols=[“{}_imputed”.format(c) for c in df.columns] ) imputer.fit(df).transform(df) Spark < 2.2 Here you are: import org.apache.spark.sql.functions.mean df.na.fill(df.columns.zip( df.select(df.columns.map(mean(_)): … Read more