Flink-ElasticSearch Sink-错误处理

时间:2019-02-09 07:31:01

标签: elasticsearch error-handling apache-flink

我正在尝试遵循此Flink指南[1],通过将失败的消息重新添加到队列中来处理ElasticSearchSink中的错误。 我遇到的错误场景将要重试:(i)UpdateRequest文档版本中的冲突,以及(ii)失去与ElasticSearch的连接。预计这些错误不会持续存在,可以通过(i)更改版本/(ii)几秒钟后消失来解决 我希望消息成功重试。 我实际上得到的是:Flink似乎卡住了(第一次)重试,我的流程排队了(背压在每个地方都是1),所有处理都挂起了。

这是我的错误处理代码:

private object MyElasticSearchFailureHandler extends ActionRequestFailureHandler {
    override def onFailure(actionRequest: ActionRequest, failure: Throwable, restStatusCode: Int, indexer: RequestIndexer): Unit = {
        if (ExceptionUtils.findThrowableWithMessage(failure, "version_conflict_engine_exception") != Optional.empty()) {
            actionRequest match {
                case s: UpdateRequest =>
                    LOG.warn(s"Failed inserting record to ElasticSearch due to version conflict (${s.version()}). Retrying")
                    LOG.warn(actionRequest.toString)
                    indexer.add(s.version(s.version() + 1))
                case _ =>
                    LOG.error("Failed inserting record to ElasticSearch due to version conflict. However, this is not an Update-Request. Don't know why.")
                    LOG.error(actionRequest.toString)
                    throw failure
            }
        } else if (restStatusCode == -1 && failure.getMessage.contains("Connection closed")) {
            LOG.warn(s"Retrying record: ${actionRequest.toString}")
            actionRequest match {
                case s: UpdateRequest => indexer.add(s)
                case s: IndexRequest => indexer.add(s)
            }
        } else {
            LOG.error(s"ELASTICSEARCH FAILED:\n    statusCode $restStatusCode\n    message: ${failure.getMessage}\n${failure.getStackTrace}")
            LOG.error(s"    DATA:\n    ${actionRequest.toString}")
            throw failure
        }
    }
}

这是我的任务管理器日志的摘录:

2019-02-09 04:12:35.676 [I / O调度程序25]错误o.a.f.s.connectors.elasticsearch.ElasticsearchSinkBase-失败的Elasticsearch批量请求:连接已关闭 2019-02-09 04:12:35.678 [I / O调度程序25] WARN cnc ..... sink.MyElasticSearchSink $-重试记录:更新{[idx-20190208] [_ doc] [doc_id_1549622700000],doc_as_upsert [true] ,doc [index { [null] [null] [null] ,source [{...}]}],scripted_upsert [false],detect_noop [true]} 2019-02-09 04:12:54.242 [接收器:S3-历史(1/4)]信息oaflink.streaming.api.functions.sink.filesystem.Buckets-子任务0检查点,用于id = 24的检查点(最大部分counter = 26)。

和作业经理日志:

2019-02-09 03:59:37.880 [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator-完成作业1a1438ca23387c4ef9a59ff9da6dafa1的检查点23(430392865字节(以307078毫秒为单位)。 2019-02-09 04:09:30.970 [Checkpoint Timer]信息org.apache.flink.runtime.checkpoint.CheckpointCoordinator-触发作业1a1438ca23387c4ef9a59ff9da6dafa1的检查点24 @ 1549685370776。 2019-02-09 04:17:00.970 [Checkpoint Timer]信息org.apache.flink.runtime.checkpoint.CheckpointCoordinator-作业1a1438ca23387c4ef9a59ff9da6dafa1的Checkpoint 24在完成前已过期。 2019-02-09 04:24:31.035 [Checkpoint Timer]信息org.apache.flink.runtime.checkpoint.CheckpointCoordinator-触发作业1a1438ca23387c4ef9a59ff9da6dafa1的检查点25 @ 1549686270776。 2019-02-09 04:32:01.035 [Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator-作业1a1438ca23387c4ef9a59ff9da6dafa1的检查点25在完成之前已过期。 2019-02-09 04:39:30.961 [Checkpoint Timer]信息org.apache.flink.runtime.checkpoint.CheckpointCoordinator-作业1a1438ca23387c4ef9a59ff9da6dafa1的触发检查点26 @ 1549687170776。

感谢和问候, 阿韦雷尔

[1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/elasticsearch.html#handling-failing-elasticsearch-requests

0 个答案:

没有答案