spark onQueryProgress重复

时间:2017-08-15 07:53:19

标签: scala apache-spark apache-kafka streaming spark-structured-streaming

当我使用StreamingQueryListener监控结构化流媒体时,我在onQueryProgress上发现重复

  override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = {

        if(queryProgress.progress.numInputRows!=0) {

          println("Query made progress: " + queryProgress.progress)

        }

结果是

Query made progress: {
  "id" : "e76a8789-738c-49f6-b7f4-d85356c28600",
  "runId" : "d8ce0fad-db38-4566-9198-90169efeb2d8",
  "name" : null,
  "timestamp" : "2017-08-15T07:28:27.077Z",
  "numInputRows" : 1,
  "processedRowsPerSecond" : 0.3050640634533252,
  "durationMs" : {
    "addBatch" : 2452,
    "getBatch" : 461,
    "queryPlanning" : 276,
    "triggerExecution" : 3278
  },
  "stateOperators" : [ ],
  "sources" : [ {
    "description" : "KafkaSource[Subscribe[test1]]",
    "startOffset" : {
      "test1" : {
        "0" : 19
      }
    },
    "endOffset" : {
      "test1" : {
        "0" : 20
      }
    },
    "numInputRows" : 1,
    "processedRowsPerSecond" : 0.3050640634533252
  } ],
  "sink" : {
    "description" : "org.apache.spark.sql.execution.streaming.ForeachSink@3ec8a100"
  }
}
Query made progress: {
  "id" : "a5b1f905-5575-43a7-afe9-dead0e4de2a7",
  "runId" : "8caea640-8772-4aab-ab13-84c1e952fb77",
  "name" : null,
  "timestamp" : "2017-08-15T07:28:27.075Z",
  "numInputRows" : 1,
  "processedRowsPerSecond" : 0.272108843537415,
  "durationMs" : {
    "addBatch" : 2844,
    "getBatch" : 445,
    "queryPlanning" : 293,
    "triggerExecution" : 3672
  },
  "stateOperators" : [ ],
  "sources" : [ {
    "description" : "KafkaSource[Subscribe[test1]]",
    "startOffset" : {
      "test1" : {
        "0" : 19
      }
    },
    "endOffset" : {
      "test1" : {
        "0" : 20
      }
    },
    "numInputRows" : 1,
    "processedRowsPerSecond" : 0.272108843537415
  } ],
  "sink" : {
    "description" : "org.apache.spark.sql.execution.streaming.ForeachSink@6953f971"
  }
}

为什么我发送1条消息,然后它有两个不同的结果。

  1. 我的主要程序问题是我应该每隔5分钟使用Spark来校准数据,比如00:00-00:05,00:05-00:10等等。一天有288点到cal。

  2. 所以我的想法是使用结构化流式传输来过滤特定数据,而不是过滤器来存储数据库,下次一起读取数据库和结构化流式传输

  3. 因此,我应该倾听每一批次来更新我阅读数据库的时间。

1 个答案:

答案 0 :(得分:0)

这两个事件来自不同的查询。您可以看到idrunId不同。