Question

我正在尝试在 SparkListener 内使用累加器来记录在 DataFrame 内处理的记录数>是从Spark SQL查询创建的。

此后，如果我尝试通过SQL查询创建 Dataframe ，它将在 DataFrame 中返回正确数量的记录> 。但是，当我尝试重新运行几次时，累加器内部的值不正确。

代码如下所示：

在SparkContext中附加侦听器：

_sparkContext.addSparkListener(new SparkListener() { 
    override def onTaskEnd(taskEnd: SparkListenerTaskEnd) {

          OutputLogger.incrementRecord(taskEnd) 

      }
    })

累加器代码：

  private var recordAccumulator = ContextHelper.getSparkContext.accumulator(0, "Record Count")
  private var stageAccumulator = ContextHelper.getSparkContext().accumulator(0, "Stage Id")

def incrementRecord(taskEnd:SparkListenerTaskEnd) = {

      if(taskEnd.taskInfo.accumulables.size > 0 && taskEnd.taskType.toLowerCase().contains("shuffle")){
        val extractedValue = taskEnd.taskMetrics.shuffleWriteMetrics.get.shuffleRecordsWritten.toInt
        if(stageAccumulator.value == taskEnd.stageId.toInt){
          recordAccumulator += extractedValue
        }
        else{
          stageAccumulator.setValue(taskEnd.stageId.toInt)
          recordAccumulator.setValue(extractedValue)
        }
      }
  }


hiveContext.sql("select * from (select *, 'a' d from tempo_one union all select * from tempo_one join ( select 'b' d) a union all select * from tempo_one join ( select 'b' d) a union all select * from tempo_one join ( select 'b' d) a) b ").registerTempTable("df1") // 180 records

用于执行SQL查询的代码：

val hiveContext = new HiveContext(sparkContext)
hiveContext.sql("select * from df1 where d = 'b'").registerTempTable("df2") //135 records

hiveContext.sql("select * from df1 where d = 'a'").registerTempTable("df3") //45 records

recordAccumulator.value //Value should be 45 however, it sometimes results to 34.

我在本地模式下运行了该版本，其Spark版本为 1.6.0

任何线索或理由说明累加器值为何有用。

谢谢。

Spark Listener中使用的累加器导致不正确的值

0 个答案: