我想获取writeStream操作写入的记录数。 为此,我有此代码。
spark.sparkContext.addSparkListener(new SparkListener() {
override def onTaskEnd(taskEnd: SparkListenerTaskEnd) {
val metrics = taskEnd.taskMetrics
if(metrics.inputMetrics != None){
inputRecords += metrics.inputMetrics.recordsRead
}
if(metrics.outputMetrics != None) {
println("OUTPUTMETRICIS NOT NONE")
recordsWritten += metrics.outputMetrics.recordsWritten
bytesWritten += metrics.outputMetrics.bytesWritten
}
numTasks += 1
println("recordsWritten = " + recordsWritten)
println("bytesWritten = " + bytesWritten)
println("numTasks = " + numTasks)
}
})
代码进入块中,但值recordWrite字节写入的输入记录始终为0。
编辑:由于有修复,因此已升级到2.3.1。仍为0
Streaming query made progress: {
"id" : "9c345af0-042c-4eeb-80db-828c5f69e442",
"runId" : "d309f7cf-624a-42e5-bb54-dfb4fa939228",
"name" : "WriteToSource",
"timestamp" : "2018-07-30T14:20:33.486Z",
"batchId" : 3,
"numInputRows" : 3511,
"inputRowsPerSecond" : 2113.786875376279,
"processedRowsPerSecond" : 3013.733905579399,
"durationMs" : {
"addBatch" : 1044,
"getBatch" : 29,
"getOffset" : 23,
"queryPlanning" : 25,
"triggerExecution" : 1165,
"walCommit" : 44
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "KafkaSource[Subscribe[proto2-events-identification-carrier]]",
"startOffset" : {
"proto2-events-identification-carrier" : {
"2" : 22400403,
"1" : 22313194,
"0" : 22381260
}
},
"endOffset" : {
"proto2-events-identification-carrier" : {
"2" : 22403914,
"1" : 22313194,
"0" : 22381260
}
},
"numInputRows" : 3511,
"inputRowsPerSecond" : 2113.786875376279,
"processedRowsPerSecond" : 3013.733905579399
} ],
"sink" : {
"description" : "org.apache.spark.sql.execution.streaming.ConsoleSinkProvider@1350f304"
}
}
显示此内容,但我无法通过代码获取它。
答案 0 :(得分:1)
Spark结构化流的FileStreamSink中有was a bug,该问题已在2.3.1版中修复。
作为一种解决方法,可以在将数据写入接收器之前使用accumulators。