zetcd的滚动重启将导致Flink进程终止

时间:2019-12-12 00:57:53

标签: apache-flink apache-curator

我正在AWS Fargate的容器中运行zetcd和Flink。 zetcd集群包含三个节点。部署策略是一次替换一个节点以维持仲裁。部署到zetcd集群会导致Flink进程由于无法连接到Zookeeper而死亡。

我观察到以下情况:

  • 开始条件:拥有三个节点的健康zetcd集群和一个健康的Flink集群。
  • 在部署第一个zetcd节点时,如果某些Flink实例正在与该特定的zetcd节点通信,则它们可能会失去与Zookeeper的连接,但是会恢复与其他健康的zetcd节点的连接。
  • 部署第二个zetcd节点时,与上面相同。此外,我观察到Flink从未尝试连接到新配置的zetcd节点。
  • 部署最后一个zetcd节点后,Flink无法与zetcd重新建立连接,并且Flink进程终止。
  • 重新配置所有Flink节点后,系统将返回正常状态。

我认为Flink在启动时会缓存zetcd节点,而Flink并不知道zetcd节点的替换。一旦替换了所有初始的zetcd节点,Flink将无法连接到zookeeper并死亡。

Flink使用Apache Curator;也许这是Curator如何管理与Zookeeper的连接的人工产物?

我非常感谢您提供有关如何使Flink与zetcd节点的当前列表保持最新的任何指南,或者如果我一开始完全不对:)


相关flink-conf.yaml

high-availability: zookeeper
high-availability.zookeeper.quorum: zetcd-service.local:2181
high-availability.storageDir: s3://flink-state/ha
high-availability.jobmanager.port: 6123

Flink失去与ZK的连接,并尝试重新连接。

00:42:07.788 [main-SendThread(ip-10-0-59-233.us-west-2.compute.internal:2181)] INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Unable to read additional data from server sessionid 0x79526ef2595a9606, likely server has closed socket, closing socket connection and attempting reconnect
00:42:07.888 [main-EventThread] INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager - State change: SUSPENDED
00:42:07.888 [Curator-ConnectionStateManager-0] WARN  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper suspended. The contender akka.tcp://flink@10.0.38.41:6123/user/dispatcher no longer participates in the leader election.
00:42:07.888 [Curator-ConnectionStateManager-0] WARN  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.
00:42:07.888 [Curator-ConnectionStateManager-0] WARN  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.
00:42:07.888 [Curator-ConnectionStateManager-0] WARN  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper suspended. The contender http://10.0.38.41:8081 no longer participates in the leader election.
00:42:07.888 [Curator-ConnectionStateManager-0] WARN  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper suspended. The contender akka.tcp://flink@10.0.38.41:6123/user/resourcemanager no longer participates in the leader election.
00:42:07.889 [Curator-PathChildrenCache-0] DEBUG org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - Received CONNECTION_SUSPENDED event
00:42:07.889 [Curator-PathChildrenCache-0] WARN  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - ZooKeeper connection SUSPENDING. Changes to the submitted job graphs are not monitored (temporarily).
00:42:08.820 [main-SendThread(ip-10-0-160-244.us-west-2.compute.internal:2181)] INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening socket connection to server ip-10-0-160-244.us-west-2.compute.internal/10.0.160.244:2181

Flink无法连接到ZK节点并死亡。

00:42:22.892 [Curator-Framework-0] ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Connection timed out for connection string (zetcd-service.local:2181) and timeout (15000) / elapsed (15004)
org.apache.flink.shaded.curator.org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
    at org.apache.flink.shaded.curator.org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225) [flink-dist_2.11-1.8.1.jar:1.8.1]
    at org.apache.flink.shaded.curator.org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94) [flink-dist_2.11-1.8.1.jar:1.8.1]
    at org.apache.flink.shaded.curator.org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117) [flink-dist_2.11-1.8.1.jar:1.8.1]
    at org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:835) [flink-dist_2.11-1.8.1.jar:1.8.1]
    at org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809) [flink-dist_2.11-1.8.1.jar:1.8.1]
    at org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64) [flink-dist_2.11-1.8.1.jar:1.8.1]
    at org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267) [flink-dist_2.11-1.8.1.jar:1.8.1]

0 个答案:

没有答案