Flink无法访问领导者jobmanager并且无法开始工作

时间:2017-03-10 09:52:21

标签: apache-zookeeper apache-flink

我们的Flink集群有两名工作管理员。最近,每当jobmanager领导者切换时,工作通常都会失败,并且flink在切换后无法恢复之前的工作。当我重新启动flink集群时,作业也无法自动启动。所以我必须手动启动这项工作。根据日志,似乎只要选出新的职位领导者,就会拒绝与新领导人的联系,这导致未能开始这项工作。在我们的jobmanager服务器上,我找不到一个活动端口58088打开。我想知道flink和zookeeper之间的谈话是否有问题。我们正在使用flink-1.0.3。 可能的原因是什么?它是一个flink bug吗?感谢。

记录:

2017-03-09 15:32:41,211 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService.
2017-03-09 15:32:41,243 INFO  org.apache.flink.runtime.webmonitor.JobManagerRetriever       - New leader reachable under akka.tcp://flink@172.27.163.235:58088/user/jobmanager:36e428ba-0af3-4e39-90d4-106b7779f94a.
2017-03-09 15:32:41,318 WARN  Remoting                                                      - Tried to associate with unreachable remote address [akka.tcp://flink@172.27.163.235:58088]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /172.27.163.235:58088
2017-03-09 15:32:41,325 WARN  org.apache.flink.runtime.webmonitor.JobManagerRetriever       - Failed to retrieve leader gateway and port.
akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://flink@172.27.163.235:58088/), Path(/user/jobmanager)]
    at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65)
    at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
    at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
    at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
    at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
    at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
    at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
    at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
    at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
    at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110)
    at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
    at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
    at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
    at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:267)
    at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:508)
    at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:541)
    at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:531)
    at akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:87)
    at akka.remote.EndpointWriter.postStop(Endpoint.scala:561)
    at akka.actor.Actor$class.aroundPostStop(Actor.scala:475)
    at akka.remote.EndpointActor.aroundPostStop(Endpoint.scala:415)
    at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210)
    at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172)
    at akka.actor.ActorCell.terminate(ActorCell.scala:369)
    at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:462)
    at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
    at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:279)
    at akka.dispatch.Mailbox.run(Mailbox.scala:220)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2017-03-09 15:32:48,029 INFO  org.apache.flink.runtime.jobmanager.JobManager                - JobManager akka.tcp://flink@172.27.163.227:36876/user/jobmanager was granted leadership with leader session ID Some(ff50dc37-048e-4d95-93f5-df788c06725c).
2017-03-09 15:32:48,037 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Delaying recovery of all jobs by 10000 milliseconds.
2017-03-09 15:32:48,038 INFO  org.apache.flink.runtime.webmonitor.JobManagerRetriever       - New leader reachable under akka.tcp://flink@172.27.163.227:36876/user/jobmanager:ff50dc37-048e-4d95-93f5-df788c06725c.
2017-03-09 15:32:49,038 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at app87 (akka.tcp://flink@172.27.165.66:32781/user/taskmanager) as e3a9364cd8deeebf8a757d3979c5ae55. Current number of registered hosts is 1. Current number of alive task slots is 4.
2017-03-09 15:32:49,044 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at app83 (akka.tcp://flink@172.27.165.58:40972/user/taskmanager) as 0c980a4c64189d975aa71cb97b1ecb7c. Current number of registered hosts is 2. Current number of alive task slots is 8.
2017-03-09 15:32:49,427 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at app27 (akka.tcp://flink@172.27.164.5:50762/user/taskmanager) as 7dcc90275bd63cbcda8361bfe00cb6e8. Current number of registered hosts is 3. Current number of alive task slots is 12.
2017-03-09 15:32:49,676 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at app26 (akka.tcp://flink@172.27.163.245:41734/user/taskmanager) as ab62098118261dcaa2d218ea17aa8117. Current number of registered hosts is 4. Current number of alive task slots is 16.
2017-03-09 15:32:49,916 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at app84 (akka.tcp://flink@172.27.165.60:53871/user/taskmanager) as 012f186f437e7ba95111ff61d206dae6. Current number of registered hosts is 5. Current number of alive task slots is 20.
2017-03-09 15:32:49,930 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at app85 (akka.tcp://flink@172.27.165.62:50068/user/taskmanager) as 68506e37647dfbff11ae193f20a7b624. Current number of registered hosts is 6. Current number of alive task slots is 24.
2017-03-09 15:32:50,658 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at app86 (akka.tcp://flink@172.27.165.64:57339/user/taskmanager) as c1e922599fae53e6edc78a2add4edb61. Current number of registered hosts is 7. Current number of alive task slots is 28.
2017-03-09 15:32:50,780 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at app25 (akka.tcp://flink@172.27.163.241:45878/user/taskmanager) as 3ee2f5d3cb8003df5531d444bd11890c. Current number of registered hosts is 8. Current number of alive task slots is 32.
2017-03-09 15:32:58,054 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Attempting to recover all jobs.
2017-03-09 15:32:58,083 ERROR org.apache.flink.runtime.jobmanager.JobManager                - Fatal error: Failed to recover jobs.
java.io.FileNotFoundException: /apps/flink/recovery/submittedJobGraphb6357063f81b (No such file or directory)
    at java.io.FileInputStream.open(Native Method)
    at java.io.FileInputStream.<init>(FileInputStream.java:138)
    at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:52)
    at org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:143)
    at org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:51)
    at org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:35)
    at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraphs(ZooKeeperSubmittedJobGraphStore.java:173)
    at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$2$$anonfun$apply$mcV$sp$2.apply$mcV$sp(JobManager.scala:433)
    at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$2$$anonfun$apply$mcV$sp$2.apply(JobManager.scala:429)
    at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$2$$anonfun$apply$mcV$sp$2.apply(JobManager.scala:429)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
    at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$2.apply$mcV$sp(JobManager.scala:429)
    at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$2.apply(JobManager.scala:425)
    at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$2.apply(JobManager.scala:425)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
    at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

0 个答案:

没有答案