spark`reduceGroups`错误重载方法与替代方案

时间:2016-11-06 15:48:43

标签: scala apache-spark

使用Spark版本2.0.1和Scala版本2.11.8运行spark-shell。

以下代码无法输入check:

val is = sc.parallelize(0 until 100)
val ds = is.map{i => (s"${i%10}", i)}
val gs = ds.groupByKey(r => r._1)
gs.reduceGroups((v: ((String, Int), (String, Int))) => (v._1._1, v._1._2 + v._2._2))

错误消息是

<console>:32: error: overloaded method value reduceGroups with alternatives:
  (f: org.apache.spark.api.java.function.ReduceFunction[(String, Int)])org.apache.spark.sql.Dataset[(String, (String, Int))] <and>
  (f: ((String, Int), (String, Int)) => (String, Int))org.apache.spark.sql.Dataset[(String, (String, Int))]
 cannot be applied to ((((String, Int), (String, Int))) => (String, Int))
       gs.reduceGroups((r : ((String, Int), (String, Int))) => (r._1._1, r._1._2 + r._2._2))

据我所知,传递给reduceGroups的lambda与第二种选择所需的签名完全匹配。

1 个答案:

答案 0 :(得分:3)

reduceGroups期望一个带有两个参数的函数,而你传递的函数是一个参数的函数。比较您传递的签名:

((V, V)) ⇒ V

虽然预期是:

(V, V) ⇒ V

其中V(String, Int)

您可以使用:

gs.reduceGroups(
  (v1: (String, Int), v2: (String, Int)) => (v1._1, v1._2 + v2._2)
)

更简洁的解决方案,它不会复制密钥:

spark.range(0, 100)
  .groupByKey(i => s"${i % 10}")
  .reduceGroups(_ + _)