我正在尝试在Apache Spark中使用自定义累加器来累积一个集合。结果应该具有Set [String]类型。为此我创建自定义累加器:
object SetAccumulatorParam extends AccumulatorParam[Set[String]] {
def addInPlace(r1: mutable.Set[String], r2: mutable.Set[String]): mutable.Set[String] = {
r1 ++= r2
}
def zero(initialValue: mutable.Set[String]): mutable.Set[String] = {
Set()
}
}
然而,我不能实例化这种类型的变量。
val tags = sc.accumulator(Set(""))(SetAccumulatorParam)
导致错误。请帮助。
required: org.apache.spark.AccumulatorParam[Set[String]]
答案 0 :(得分:2)
添加到Traian的答案,这里是针对spark 2.x的一般情况SetAccumulator。
import org.apache.spark.util.AccumulatorV2
class SetAccumulator[T](var value: Set[T]) extends AccumulatorV2[T, Set[T]] {
def this() = this(Set.empty[T])
override def isZero: Boolean = value.isEmpty
override def copy(): AccumulatorV2[T, Set[T]] = new SetAccumulator[T](value)
override def reset(): Unit = value = Set.empty[T]
override def add(v: T): Unit = value = value + v
override def merge(other: AccumulatorV2[T, Set[T]]): Unit = value = value ++ other.value
}
你可以像这样使用它:
val accum = new SetAccumulator[String]()
spark.sparkContext.register(accum, "My Accum") // Optional, name it for SparkUI
spark.sparkContext.parallelize(Seq("a", "b", "a", "b", "c")).foreach(s => accum.add(s))
accum.value
哪个输出:
Set[String] = Set(a, b, c)
答案 1 :(得分:1)
1.6的更新:
object StringSetAccumulatorParam extends AccumulatorParam[Set[String]] {
def zero(initialValue: Set[String]): Set[String] = { Set() }
def addInPlace(s1: Set[String], s2: Set[String]): Set[String] = { s1 ++ s2 }
}
val stringSetAccum = sc.accumulator(Set[String]())(StringSetAccumulatorParam)
sc.parallelize(Array("1", "2", "3", "1")).foreach(s => stringSetAccum += Set(s))
stringSetAccum.value.toString
res0: String = Set(2, 3, 1)
在Spark 2.0中,使用现有的collectionAccumulator可能会很好(如果您关心不同的值,只有在它们不存在时才可以检查和添加):
val collAcc = spark.sparkContext.collectionAccumulator[String]("myCollAcc")
collAcc: org.apache.spark.util.CollectionAccumulator[String] = CollectionAccumulator(id: 32154, name: Some(myCollAcc), value: [])
spark.sparkContext.parallelize(Array("1", "2", "3")).foreach(s => collAcc.add(s))
collAcc.value.toString
res0: String = [3, 2, 1]
更多信息:https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.util.AccumulatorV2