Flink失去了领导和崩溃

时间:2018-02-09 13:29:26

标签: java apache akka apache-zookeeper apache-flink

我在LocalStreamEnvironment(嵌入式flink集群)中运行流处理应用程序。我成功地使用我的代码处理了一个特定的数据集。我想在对处理逻辑进行一些修改之后昨天重新运行应用程序,但是在大约3/4通过数据处理之后,似乎flink集群没有任何理由崩溃。查看精简日志 - 我的评论插入尖括号<>:

2018-02-09 12:04:05,146 [INFO] from a.b.l.f.MultiS3FileSource in Source: General source (1/1) - inserting 266574 events
2018-02-09 12:10:55,094 [ERROR] from o.a.f.r.c.JobSubmissionClientActor in flink-akka.actor.default-dispatcher-11020 - class org.apache.flink.runtime.client.JobSubmissionClientActor received unknown message: 
2018-02-09 12:10:55,245 [WARN] from o.a.f.r.c.JobSubmissionClientActor in flink-akka.actor.default-dispatcher-11019 - Discard message LeaderSessionMessage(7240d925-8573-44e8-996c-fa4658ab0463,02/09/2018 12:10:55 Process -> Detection(7/8) switched to CANCELED ) because there is currently no valid leader id known.
2018-02-09 12:10:55,268 [WARN] from o.a.f.r.c.JobSubmissionClientActor in flink-akka.actor.default-dispatcher-11019 - Discard message LeaderSessionMessage(7240d925-8573-44e8-996c-fa4658ab0463,02/09/2018 12:10:55 Enrichment-> Flat Map(7/8) switched to CANCELED ) because there is currently no valid leader id known.
... <similar messages for all the processing steps>
2018-02-09 12:10:55,509 [ERROR] from o.a.f.s.r.t.StreamTask in PartialAggregations-> Sink: CassandraSink (1/8) - Error during disposal of stream operator.
java.lang.InterruptedException: null <because its interrupting a future>
... <for all of my sinks - these are custom, not the flink cassandra connectors>

第一条信息是关于我的资源从s3读取数据并将其收集到flink。

之后第一个错误产生于:https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/client/JobSubmissionClientActor.java#L137

并且警告由:https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/akka/FlinkUntypedActor.java#L115

生成

最后一个错误出现在我的代码中,但它是由flink尝试拆除作业引起的,所以它不应该是错误的原因。

我可以提供一些额外的信息,但我不确定什么是相关的。

第一个错误似乎是整个崩溃的级联。 JobSubmissionClientActor如何具有null getLeaderSessionID?如果flink正在嵌入运行,JobSubmissionClientActor会发生什么样的消息?在我看来,它能够收到的所有信息都是关于提交工作的信息。甚至可以在嵌入模式下实现吗?我该如何防止这次崩溃?

更新: 我想我误解了错误日志。当我再次执行执行时,我得到的事件顺序略有不同。在上一次运行中,我只得到了流处理的错误,没有明显的原因导致流结束,因为上一个错误似乎没有包含在我的日志文件中(虽然打印到stdout)。此错误如下,之前的错误与上一次运行中的错误类似(处理流的错误)。

[error] Exception in thread "main" org.apache.flink.runtime.client.JobExecutionException: JobClientActor seems to have died before the JobExecutionResult could be retrieved.[error]         at org.apache.flink.runtime.client.JobClient.awaitJobResult(JobClient.java:285)
[error]         at org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:387)
[error]         at org.apache.flink.runtime.minicluster.FlinkMiniCluster.submitJobAndWait(FlinkMiniCluster.scala:565)
[error]         at org.apache.flink.runtime.minicluster.FlinkMiniCluster.submitJobAndWait(FlinkMiniCluster.scala:539)
[error]         at org.apache.flink.streaming.api.environment.LocalStreamEnvironment.execute(LocalStreamEnvironment.java:108)
[error]         at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1501)
[error]         at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:629)
[error]         at a.b.l.flink.FlinkIngestPrototype$.run(FlinkIngestPrototype.scala:90)
[error]         at a.b.l.flink.FlinkIngestPrototype$.main(FlinkIngestPrototype.scala:43)
[error]         at a.b.l.flink.FlinkIngestPrototype.main(FlinkIngestPrototype.scala)
[error] Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]
[error]         at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
[error]         at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
[error]         at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
[error]         at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
[error]         at scala.concurrent.Await$.result(package.scala:190)
[error]         at scala.concurrent.Await.result(package.scala)
[error]         at org.apache.flink.runtime.client.JobClient.awaitJobResult(JobClient.java:273)
[error]         ... 9 more

我已将执行失败追溯到以下内容:

  1. JobClient对象ping作业客户端actor是否已完成,如果不是,那么如果他还活着,它只会ping他。活动ping是:https://github.com/apache/flink/blob/62a777bc8ddfb4e34d7beaf7091a90b0bcc70c51/flink-runtime/src/main/java/org/apache/flink/runtime/client/JobClient.java#L273

  2. 此ping超时并向作业演员发送毒丸,导致所有不同的处理错误。

  3. 我遇到了一些期货问题,之后他们会以非确定性的方式中断更短的超时。我在某种程度上调试了这个问题,我认为这是因为一些非常长的GC暂停(或类似的东西)。说明超时如何与GC同步暂停:https://imgur.com/a/9mMvN。我认为这可能是造成这种超时的原因。这是我的GC配置:

    "-XX:-UseParallelGC",
    "-XX:-UseConcMarkSweepGC",
    "-XX:+UseG1GC",
    

    根据大多数来源,应该导致非常短暂的GC暂停(不到一秒钟)。任何人都有在flink中获得非常长的GC暂停的经验吗?这可能是以某种方式连接到HW的问题吗?我正在EC2 AWS实例上运行该应用程序。

1 个答案:

答案 0 :(得分:1)

正如你所说,这是GC暂停的问题,我试图解决这类问题的是:

  1. 降低工作记忆要求
  2. 增加系统可用内存
  3. 增加心跳超时,以便在长时间暂停后不会崩溃