Question

我关注数据集：

val myDS = List(("a",1,1.1), ("b",2,1.2), ("a",3,3.1), ("b",4,1.4), ("a",5,5.1)).toDS
// and aggregation
// myDS.groupByKey(t2 => t2._1).agg(myAvg).collect()

我想编写自定义聚合函数myAvg，它接受Tuple3参数并返回sum(_._2)/sum(_._3)。我知道，它可以通过其他方式计算，但我想编写自定义聚合。

我写了类似的东西：

    import org.apache.spark.sql.expressions.Aggregator
    import org.apache.spark.sql.{Encoder, Encoders}

    val myAvg =  new Aggregator[Tuple3[String, Integer, Double], 
                                Tuple2[Integer,Double], 
                                Double] {
      def zero: Tuple2[Integer,Double] = Tuple2(0,0.0)
      def reduce(agg: Tuple2[Integer,Double], 
                 a: Tuple3[String, Integer,Double]): Tuple2[Integer,Double] = 
                              Tuple2(agg._1 + a._2, agg._2 + a._3)
      def merge(agg1: Tuple2[Integer,Double], 
                agg2: Tuple2[Integer,Double]): Tuple2[Integer,Double] = 
                              Tuple2(agg1._1 + agg2._1, agg1._2 + agg2._2) 
      def finish(res: Tuple2[Integer,Double]): Double = res._1/res._2
      def bufferEncoder: Encoder[(Integer, Double)] =
                              Encoders.tuple(Encoders.INT, Encoders.scalaDouble)
      def outputEncoder: Encoder[Double] = Encoders.scalaDouble
    }.toColumn()

不幸的是我收到以下错误：

java.lang.RuntimeException: Unsupported literal type class scala.runtime.BoxedUnit ()
    at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:75)
    at org.apache.spark.sql.functions$.lit(functions.scala:101)
    at org.apache.spark.sql.Column.apply(Column.scala:217)

出了什么问题？

在我当地的Spark 2.1中，我收到一个警告

warning: there was one deprecation warning; re-run with -deprecation for details

我的代码中弃用了什么？

感谢您的任何建议。

Answer 1

这里的问题似乎是您使用Java的Integer而不是Scala的Int - 如果您替换了聚合器中Integer的所有用法使用Int进行实施（并将Encoders.INT替换为Encoders.scalaInt） - 这可以按预期工作：

val myAvg: TypedColumn[(String, Int, Double), Double] =  new Aggregator[(String, Int, Double), (Int, Double), Double] {
  def zero: (Int, Double) = Tuple2(0,0.0)
  def reduce(agg: (Int, Double), a: (String, Int, Double)): (Int, Double) =
    (agg._1 + a._2, agg._2 + a._3)
  def merge(agg1: (Int, Double), agg2: (Int, Double)): (Int, Double) =
    (agg1._1 + agg2._1, agg1._2 + agg2._2)
  def finish(res: (Int, Double)): Double = res._1/res._2
  def bufferEncoder: Encoder[(Int, Double)] =
    Encoders.tuple(Encoders.scalaInt, Encoders.scalaDouble)
  def outputEncoder: Encoder[Double] = Encoders.scalaDouble
}.toColumn

（还应用了一些语法糖，删除了明确的Tuble引用）。

spark custom Aggregator＆gt; = 2.0（scala）

1 个答案: