我需要拥有由业务对象属性驱动的自己的指标。例如四键,我想在处理阶段对其进行汇总。因此,每当执行者在处理阶段面对新的聚合标签时,就会创建新的指标累加器。像这样:
val ints = sparkSession.sparkContext.parallelize(0 to 9, 3)
//calculate odd and even numbers
ints.foreach { n =>
println(s"int: $n")
val metricAccumulator = getOrCreateAccumulator((n%2).toString)
metricAccumulator.add(1)
}
是否可以通过执行者创建累加器?因为我现在不提前精确地汇总存储桶。还是更好的方法?
更新1 我已经基于以下内容创建了自己的MapAccumulator:
class HashMapAccumulator(var value: MutableHashMap[String, Int]) extends AccumulatorV2[(String, Int), MutableHashMap[String, Int]]{
def this() = this(MutableHashMap.empty)
override def isZero: Boolean = value.isEmpty
override def copy(): AccumulatorV2[(String, Int), MutableHashMap[String, Int]] = new HashMapAccumulator()
override def reset(): Unit = value = MutableHashMap.empty
override def add(v: (String, Int)): Unit = value.get(v._1) match {
case Some(e) => value.update(v._1, e + v._2)
case None => value += v
}
override def merge(other: AccumulatorV2[(String, Int), MutableHashMap[String, Int]]): Unit = other match {
case map: HashMapAccumulator =>
map.value.foreach(v => this.add(v))
case _ =>
throw new UnsupportedOperationException(
s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}")
}
}
并在这样的测试中使用:
val ints = sparkSession.sparkContext.parallelize(0 to 9, 3)
val accum = new HashMapAccumulator()
sparkSession.sparkContext.register(accum, "My Accum")
var counter = 0
ints.foreach { n =>
println(s"int: $n")
counter = counter + 1
accum.add((n % 2).toString -> 1)
}
看起来像这样很好。有任何建议吗?