我正在尝试遵循此Flink指南[1],通过将失败的消息重新添加到队列中来处理ElasticSearchSink中的错误。 我遇到的错误场景将要重试:(i)UpdateRequest文档版本中的冲突,以及(ii)失去与ElasticSearch的连接。预计这些错误不会持续存在,可以通过(i)更改版本/(ii)几秒钟后消失来解决 我希望消息成功重试。 我实际上得到的是:Flink似乎卡住了(第一次)重试,我的流程排队了(背压在每个地方都是1),所有处理都挂起了。
这是我的错误处理代码:
private object MyElasticSearchFailureHandler extends ActionRequestFailureHandler {
override def onFailure(actionRequest: ActionRequest, failure: Throwable, restStatusCode: Int, indexer: RequestIndexer): Unit = {
if (ExceptionUtils.findThrowableWithMessage(failure, "version_conflict_engine_exception") != Optional.empty()) {
actionRequest match {
case s: UpdateRequest =>
LOG.warn(s"Failed inserting record to ElasticSearch due to version conflict (${s.version()}). Retrying")
LOG.warn(actionRequest.toString)
indexer.add(s.version(s.version() + 1))
case _ =>
LOG.error("Failed inserting record to ElasticSearch due to version conflict. However, this is not an Update-Request. Don't know why.")
LOG.error(actionRequest.toString)
throw failure
}
} else if (restStatusCode == -1 && failure.getMessage.contains("Connection closed")) {
LOG.warn(s"Retrying record: ${actionRequest.toString}")
actionRequest match {
case s: UpdateRequest => indexer.add(s)
case s: IndexRequest => indexer.add(s)
}
} else {
LOG.error(s"ELASTICSEARCH FAILED:\n statusCode $restStatusCode\n message: ${failure.getMessage}\n${failure.getStackTrace}")
LOG.error(s" DATA:\n ${actionRequest.toString}")
throw failure
}
}
}
这是我的任务管理器日志的摘录:
2019-02-09 04:12:35.676 [I / O调度程序25]错误o.a.f.s.connectors.elasticsearch.ElasticsearchSinkBase-失败的Elasticsearch批量请求:连接已关闭 2019-02-09 04:12:35.678 [I / O调度程序25] WARN cnc ..... sink.MyElasticSearchSink $-重试记录:更新{[idx-20190208] [_ doc] [doc_id_1549622700000],doc_as_upsert [true] ,doc [index { [null] [null] [null] ,source [{...}]}],scripted_upsert [false],detect_noop [true]} 2019-02-09 04:12:54.242 [接收器:S3-历史(1/4)]信息oaflink.streaming.api.functions.sink.filesystem.Buckets-子任务0检查点,用于id = 24的检查点(最大部分counter = 26)。
和作业经理日志:
2019-02-09 03:59:37.880 [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator-完成作业1a1438ca23387c4ef9a59ff9da6dafa1的检查点23(430392865字节(以307078毫秒为单位)。 2019-02-09 04:09:30.970 [Checkpoint Timer]信息org.apache.flink.runtime.checkpoint.CheckpointCoordinator-触发作业1a1438ca23387c4ef9a59ff9da6dafa1的检查点24 @ 1549685370776。 2019-02-09 04:17:00.970 [Checkpoint Timer]信息org.apache.flink.runtime.checkpoint.CheckpointCoordinator-作业1a1438ca23387c4ef9a59ff9da6dafa1的Checkpoint 24在完成前已过期。 2019-02-09 04:24:31.035 [Checkpoint Timer]信息org.apache.flink.runtime.checkpoint.CheckpointCoordinator-触发作业1a1438ca23387c4ef9a59ff9da6dafa1的检查点25 @ 1549686270776。 2019-02-09 04:32:01.035 [Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator-作业1a1438ca23387c4ef9a59ff9da6dafa1的检查点25在完成之前已过期。 2019-02-09 04:39:30.961 [Checkpoint Timer]信息org.apache.flink.runtime.checkpoint.CheckpointCoordinator-作业1a1438ca23387c4ef9a59ff9da6dafa1的触发检查点26 @ 1549687170776。
感谢和问候, 阿韦雷尔