我正在AWS Fargate的容器中运行zetcd和Flink。 zetcd集群包含三个节点。部署策略是一次替换一个节点以维持仲裁。部署到zetcd集群会导致Flink进程由于无法连接到Zookeeper而死亡。
我观察到以下情况:
我认为Flink在启动时会缓存zetcd节点,而Flink并不知道zetcd节点的替换。一旦替换了所有初始的zetcd节点,Flink将无法连接到zookeeper并死亡。
Flink使用Apache Curator;也许这是Curator如何管理与Zookeeper的连接的人工产物?
我非常感谢您提供有关如何使Flink与zetcd节点的当前列表保持最新的任何指南,或者如果我一开始完全不对:)
相关flink-conf.yaml
high-availability: zookeeper
high-availability.zookeeper.quorum: zetcd-service.local:2181
high-availability.storageDir: s3://flink-state/ha
high-availability.jobmanager.port: 6123
Flink失去与ZK的连接,并尝试重新连接。
00:42:07.788 [main-SendThread(ip-10-0-59-233.us-west-2.compute.internal:2181)] INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Unable to read additional data from server sessionid 0x79526ef2595a9606, likely server has closed socket, closing socket connection and attempting reconnect
00:42:07.888 [main-EventThread] INFO org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager - State change: SUSPENDED
00:42:07.888 [Curator-ConnectionStateManager-0] WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper suspended. The contender akka.tcp://flink@10.0.38.41:6123/user/dispatcher no longer participates in the leader election.
00:42:07.888 [Curator-ConnectionStateManager-0] WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.
00:42:07.888 [Curator-ConnectionStateManager-0] WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.
00:42:07.888 [Curator-ConnectionStateManager-0] WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper suspended. The contender http://10.0.38.41:8081 no longer participates in the leader election.
00:42:07.888 [Curator-ConnectionStateManager-0] WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper suspended. The contender akka.tcp://flink@10.0.38.41:6123/user/resourcemanager no longer participates in the leader election.
00:42:07.889 [Curator-PathChildrenCache-0] DEBUG org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - Received CONNECTION_SUSPENDED event
00:42:07.889 [Curator-PathChildrenCache-0] WARN org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - ZooKeeper connection SUSPENDING. Changes to the submitted job graphs are not monitored (temporarily).
00:42:08.820 [main-SendThread(ip-10-0-160-244.us-west-2.compute.internal:2181)] INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening socket connection to server ip-10-0-160-244.us-west-2.compute.internal/10.0.160.244:2181
Flink无法连接到ZK节点并死亡。
00:42:22.892 [Curator-Framework-0] ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Connection timed out for connection string (zetcd-service.local:2181) and timeout (15000) / elapsed (15004)
org.apache.flink.shaded.curator.org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
at org.apache.flink.shaded.curator.org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225) [flink-dist_2.11-1.8.1.jar:1.8.1]
at org.apache.flink.shaded.curator.org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94) [flink-dist_2.11-1.8.1.jar:1.8.1]
at org.apache.flink.shaded.curator.org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117) [flink-dist_2.11-1.8.1.jar:1.8.1]
at org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:835) [flink-dist_2.11-1.8.1.jar:1.8.1]
at org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809) [flink-dist_2.11-1.8.1.jar:1.8.1]
at org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64) [flink-dist_2.11-1.8.1.jar:1.8.1]
at org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267) [flink-dist_2.11-1.8.1.jar:1.8.1]