我正在尝试使用Scala在Spark中创建一些简单的自定义聚合运算符。
我创建了一个简单的运算符层次结构,使用以下超类:
sealed abstract class Aggregator(val name: String) {
type Key = Row // org.apache.spark.sql.Row
type Value
...
}
我还有一个伴侣对象,每次构造适当的聚合器。注意每个操作员都可以指定它想要的值类型。
现在我的问题是当我尝试拨打combineByKey
时:
val agg = Aggregator("SUM")
val res = rdd
.map(agg.mapper)
.reduceByKey(agg.reducer(_: agg.Value, _: agg.Value))
错误是:
value reduceByKey is not a member of org.apache.spark.rdd.RDD[(agg.Key, agg.Value)]
根据我的需要,Value
可以是数字类型或元组,因此它没有边界定义。如果我将Value
类型声明替换为:
type Value = Double
在Aggregator
课程中,一切正常。因此,我认为该错误与reduceByKey
在编译时不知道确切的Value
类型有关。
关于如何解决这个问题的任何想法?
答案 0 :(得分:2)
Your RDD
cannot be implicitly converted into PairRDDFunctions
, because all the implicit ClassTag
s for keys and values are missing.
You might want to include the class tags as implicit parameters in your Aggregator
:
sealed abstract class Aggregator[K: ClassTag, V: ClassTag](name: String) {
implicit val keyClassTag: ClassTag[K] = implicitly
implicit val valueClassTag: ClassTag[V] = implicitly
}
or maybe:
sealed abstract class Aggregator[K, V](name: String)(implicit kt: ClassTag[K], vt: ClassTag[V]) {
implicit val keyClassTag: ClassTag[K] = kt
implicit val valueClassTag: ClassTag[V] = vt
}
or maybe even:
sealed abstract class Aggregator(name: String) {
type K
type V
implicit def keyClassTag: ClassTag[K]
implicit def valueClassTag: ClassTag[V]
}
The last variant would shift the responsibility for providing the ClassTag
s to the implementor of the abstract class.
Now, when using an aggregator a
of type Aggregator[K, V]
in a reduceByKey
, you would have to make sure that those implicitly provided class tags are in the current implicit scope:
val agg = Aggregator("SUM")
import agg._ // now the implicits should be visible
val res = rdd
.map(agg.mapper)
.reduceByKey(agg.reducer(_: agg.Value, _: agg.Value))