从执行者创建或访问Spark累加器

时间:2019-01-29 10:31:03

标签: scala apache-spark metrics

我需要拥有由业务对象属性驱动的自己的指标。例如四键,我想在处理阶段对其进行汇总。因此,每当执行者在处理阶段面对新的聚合标签时,就会创建新的指标累加器。像这样:

val ints = sparkSession.sparkContext.parallelize(0 to 9, 3)
//calculate odd and even numbers
ints.foreach { n =>
  println(s"int: $n")
  val metricAccumulator = getOrCreateAccumulator((n%2).toString)
  metricAccumulator.add(1)
}

是否可以通过执行者创建累加器?因为我现在不提前精确地汇总存储桶。还是更好的方法?

更新1 我已经基于以下内容创建了自己的MapAccumulator:

class HashMapAccumulator(var value: MutableHashMap[String, Int]) extends AccumulatorV2[(String, Int), MutableHashMap[String, Int]]{
  def this() = this(MutableHashMap.empty)

  override def isZero: Boolean = value.isEmpty

  override def copy(): AccumulatorV2[(String, Int), MutableHashMap[String, Int]] = new HashMapAccumulator()

  override def reset(): Unit = value = MutableHashMap.empty

  override def add(v: (String, Int)): Unit = value.get(v._1) match {
    case Some(e) => value.update(v._1, e + v._2)
    case None => value += v
  }

  override def merge(other: AccumulatorV2[(String, Int), MutableHashMap[String, Int]]): Unit = other match {
    case map: HashMapAccumulator =>
      map.value.foreach(v => this.add(v))
    case _ =>
      throw new UnsupportedOperationException(
        s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}")
  }
}

并在这样的测试中使用:

    val ints = sparkSession.sparkContext.parallelize(0 to 9, 3)
val accum = new HashMapAccumulator()
sparkSession.sparkContext.register(accum, "My Accum")
var counter = 0
ints.foreach { n =>
  println(s"int: $n")
  counter = counter + 1
  accum.add((n % 2).toString -> 1)
}

看起来像这样很好。有任何建议吗?

0 个答案:

没有答案