当我使用StreamingQueryListener监控结构化流媒体时,我在onQueryProgress上发现重复
override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = {
if(queryProgress.progress.numInputRows!=0) {
println("Query made progress: " + queryProgress.progress)
}
结果是
Query made progress: {
"id" : "e76a8789-738c-49f6-b7f4-d85356c28600",
"runId" : "d8ce0fad-db38-4566-9198-90169efeb2d8",
"name" : null,
"timestamp" : "2017-08-15T07:28:27.077Z",
"numInputRows" : 1,
"processedRowsPerSecond" : 0.3050640634533252,
"durationMs" : {
"addBatch" : 2452,
"getBatch" : 461,
"queryPlanning" : 276,
"triggerExecution" : 3278
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "KafkaSource[Subscribe[test1]]",
"startOffset" : {
"test1" : {
"0" : 19
}
},
"endOffset" : {
"test1" : {
"0" : 20
}
},
"numInputRows" : 1,
"processedRowsPerSecond" : 0.3050640634533252
} ],
"sink" : {
"description" : "org.apache.spark.sql.execution.streaming.ForeachSink@3ec8a100"
}
}
Query made progress: {
"id" : "a5b1f905-5575-43a7-afe9-dead0e4de2a7",
"runId" : "8caea640-8772-4aab-ab13-84c1e952fb77",
"name" : null,
"timestamp" : "2017-08-15T07:28:27.075Z",
"numInputRows" : 1,
"processedRowsPerSecond" : 0.272108843537415,
"durationMs" : {
"addBatch" : 2844,
"getBatch" : 445,
"queryPlanning" : 293,
"triggerExecution" : 3672
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "KafkaSource[Subscribe[test1]]",
"startOffset" : {
"test1" : {
"0" : 19
}
},
"endOffset" : {
"test1" : {
"0" : 20
}
},
"numInputRows" : 1,
"processedRowsPerSecond" : 0.272108843537415
} ],
"sink" : {
"description" : "org.apache.spark.sql.execution.streaming.ForeachSink@6953f971"
}
}
为什么我发送1条消息,然后它有两个不同的结果。
我的主要程序问题是我应该每隔5分钟使用Spark来校准数据,比如00:00-00:05,00:05-00:10等等。一天有288点到cal。
所以我的想法是使用结构化流式传输来过滤特定数据,而不是过滤器来存储数据库,下次一起读取数据库和结构化流式传输
答案 0 :(得分:0)
这两个事件来自不同的查询。您可以看到id
和runId
不同。