将聚合函数应用于特定类型的每个列

时间:2015-09-09 20:08:32

标签: scala apache-spark apache-spark-sql

所以我写了关于如何平均数据框中每个FloatType列的基础(不起作用):

val descript = df.dtypes

  var decimalArr = new ListBuffer[String]()
  for(i <- 0 to (descript.length - 1)) {
    if(descript(i)._2 == "FloatType") {
      decimalArr += descript(i)._1
    }
  }

  //Build Statsitical Arguments for DataFrame Pass
  var averageList = new ListBuffer[String]()
  for(i <- 0 to (decimalArr.length - 1)){
    averageList += "avg(" + '"' + decimalArr(i) + '"' + ")"
  }

  //sample statsitical call
  val sampAvg = df.agg(averageList).show 

averageList生成的示例是:

ListBuffer(avg("offer_id"), avg("decision_id"), avg("offer_type_cd"), avg("promo_id"), avg("pymt_method_type_cd"), avg("cs_result_id"), avg("cs_result_usage_type_cd"), avg("rate_index_type_cd"), avg("sub_product_id"))

明显的问题是val sampAvg = df.agg(averageList).show不允许listBuffer作为输入。所以即使把它.toString也没用,它想要org.apache.spark.sql.Column *。有谁知道我可以用我正在尝试的方式做某事。

旁注我在Spark 1.3

1 个答案:

答案 0 :(得分:3)

您可以先构建聚合表达式列表

import org.apache.spark.sql.functions.{col, avg, lit}

val exprs = df.dtypes
  .filter(_._2 == "DoubleType")
  .map(ct => avg(col(ct._1))).toList

和任一模式匹配

exprs match {
  case h::t => df.agg(h, t:_*)
  case _ => sqlContext.emptyDataFrame
}

或使用虚拟列

df.agg(lit(1).alias("_dummy"), exprs: _*).drop("_dummy")

如果您想使用多个功能,可以flatMap明确地:

import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{avg, min, max}

val funs: List[(String => Column)] = List(min, max, avg)

val exprs: Array[Column] = df.dtypes 
   .filter(_._2 == "DoubleType")
   .flatMap(ct => funs.map(fun => fun(ct._1)))

或用于理解:

val exprs: Array[Column] = for {
    cname <-  df.dtypes.filter(_._2 == "DoubleType").map(_._1)
    fun <- funs
} yield fun(cname)

如果您想使用模式匹配方法,请将exprs转换为List