Scala-Spark Dynamically call groupby and agg with parameter values

Your code is almost correct – with two issues:

  1. The return type of your function is DataFrame, but the last line is aggregated.show(), which returns Unit. Remove the call to show to return aggregated itself, or just return the result of agg immediately

  2. DataFrame.groupBy expects arguments as follows: col1: String, cols: String* – so you need to pass matching arguments: the first columns, and then the rest of the columns as a list of arguments, you can do that as follows: df.groupBy(cols.head, cols.tail: _*)

Altogether, your function would be:

def groupAndAggregate(df: DataFrame,  aggregateFun: Map[String, String], cols: List[String] ): DataFrame ={
  val grouped = df.groupBy(cols.head, cols.tail: _*)
  val aggregated = grouped.agg(aggregateFun)
  aggregated
}

Or, a similar shorter version:

def groupAndAggregate(df: DataFrame,  aggregateFun: Map[String, String], cols: List[String] ): DataFrame = {
  df.groupBy(cols.head, cols.tail: _*).agg(aggregateFun)
}

If you do want to call show within your function:

def groupAndAggregate(df: DataFrame,  aggregateFun: Map[String, String], cols: List[String] ): DataFrame ={
  val grouped = df.groupBy(cols.head, cols.tail: _*)
  val aggregated = grouped.agg(aggregateFun)
  aggregated.show()
  aggregated
}

Leave a Comment