带有泛型类型的Spark reduceByKey(Scala)

时间:2018-05-13 17:11:27

标签: scala apache-spark types scala-generics

我正在尝试使用Scala在Spark中创建一些简单的自定义聚合运算符。

我创建了一个简单的运算符层次结构,使用以下超类:

sealed abstract class Aggregator(val name: String) {
  type Key = Row  // org.apache.spark.sql.Row
  type Value

  ...
}

我还有一个伴侣对象,每次构造适当的聚合器。注意每个操作员都可以指定它想要的值类型。

现在我的问题是当我尝试拨打combineByKey时:

val agg = Aggregator("SUM")
val res = rdd
    .map(agg.mapper)
    .reduceByKey(agg.reducer(_: agg.Value, _: agg.Value))

错误是:

value reduceByKey is not a member of org.apache.spark.rdd.RDD[(agg.Key, agg.Value)]

根据我的需要,Value可以是数字类型或元组,因此它没有边界定义。如果我将Value类型声明替换为:

type Value = Double

Aggregator课程中,一切正常。因此,我认为该错误与reduceByKey在编译时不知道确切的Value类型有关。

关于如何解决这个问题的任何想法?

1 个答案:

答案 0 :(得分:2)

Your RDD cannot be implicitly converted into PairRDDFunctions, because all the implicit ClassTags for keys and values are missing.

You might want to include the class tags as implicit parameters in your Aggregator:

sealed abstract class Aggregator[K: ClassTag, V: ClassTag](name: String) {
  implicit val keyClassTag: ClassTag[K] = implicitly
  implicit val valueClassTag: ClassTag[V] = implicitly
}

or maybe:

sealed abstract class Aggregator[K, V](name: String)(implicit kt: ClassTag[K], vt: ClassTag[V]) {
  implicit val keyClassTag: ClassTag[K] = kt
  implicit val valueClassTag: ClassTag[V] = vt
}

or maybe even:

sealed abstract class Aggregator(name: String) {
  type K
  type V
  implicit def keyClassTag: ClassTag[K]
  implicit def valueClassTag: ClassTag[V]
}

The last variant would shift the responsibility for providing the ClassTags to the implementor of the abstract class.

Now, when using an aggregator a of type Aggregator[K, V] in a reduceByKey, you would have to make sure that those implicitly provided class tags are in the current implicit scope:

val agg = Aggregator("SUM")
import agg._ // now the implicits should be visible
val res = rdd
.map(agg.mapper)
.reduceByKey(agg.reducer(_: agg.Value, _: agg.Value))