我们的进程在Linux系统上运行,该系统几乎使用了TB的RAM,未启用任何交换。
发生的事情是我们的进程由于某种原因而冻结了一段时间,原因是我无法弄清,因此Zookeeper会按期终止我们的会话,然后该进程恢复活动,日志中未显示任何触发的事件。
我们遇到了类似的情况,但是当我们的流程恢复正常时,就会触发连接丢失和会话过期事件,因此我们可以通过在Zookeeper上重新创建该流程的关联临时节点来处理这种情况。我们认为这是由于整个GC周期造成的。
现在的新功能是该过程冻结,但是在重新启动后没有触发任何事件!因此,无法检测到我们的会话已过期。
我正在考虑仅监视我们的临时节点是否已删除,然后重新创建它。但是我想知道这是否是正确的选择,因为我仍然不知道为什么该过程最初会冻结。
增加会话超时不是一种选择,因为它对我们来说已经太高了。而且我们还是试图处理会话超时。
所以我的问题很简单:
编辑 在增加Zookeeper的日志记录详细信息后,我发现了一些非常有趣的东西
DEBUG: [07:05:57] [demo | HA | Manager] Got ping response for sessionid: 0x3000da76fa904b6 after 0ms [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
DEBUG: [07:06:31] [demo | HA | Manager] Got ping response for sessionid: 0x3000da76fa904b6 after 0ms [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
DEBUG: [07:07:04] [demo | HA | Manager] Got ping response for sessionid: 0x3000da76fa904b6 after 0ms [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
DEBUG: [07:07:37] [demo | HA | Manager] Got ping response for sessionid: 0x3000da76fa904b6 after 0ms [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
DEBUG: [07:08:11] [demo | HA | Manager] Got ping response for sessionid: 0x3000da76fa904b6 after 0ms [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
DEBUG: [07:08:44] [demo | HA | Manager] Got ping response for sessionid: 0x3000da76fa904b6 after 0ms [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
DEBUG: [07:09:17] [demo | HA | Manager] Got ping response for sessionid: 0x3000da76fa904b6 after 0ms [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
DEBUG: [07:09:51] [demo | HA | Manager] Got ping response for sessionid: 0x3000da76fa904b6 after 0ms [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
DEBUG: [07:10:24] [demo | HA | Manager] Got ping response for sessionid: 0x3000da76fa904b6 after 0ms [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
DEBUG: [07:10:57] [demo | HA | Manager] Got ping response for sessionid: 0x3000da76fa904b6 after 0ms [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
DEBUG: [07:11:31] [demo | HA | Manager] Got ping response for sessionid: 0x3000da76fa904b6 after 0ms [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
DEBUG: [07:12:04] [demo | HA | Manager] Got ping response for sessionid: 0x3000da76fa904b6 after 0ms [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
DEBUG: [07:12:38] [demo | HA | Manager] Got ping response for sessionid: 0x3000da76fa904b6 after 0ms [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
DEBUG: [07:13:11] [demo | HA | Manager] Got ping response for sessionid: 0x3000da76fa904b6 after 0ms [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
DEBUG: [07:13:44] [demo | HA | Manager] Got ping response for sessionid: 0x3000da76fa904b6 after 0ms [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
DEBUG: [07:14:18] [demo | HA | Manager] Got ping response for sessionid: 0x3000da76fa904b6 after 0ms [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
仔细观察,您会发现每个日志之间的时间差约为33秒。在我的计算机上时,日志消息每隔约1秒钟显示一次。这可能是由于网络延迟造成的吗?
编辑
Running the mntr command returned the following stats
zk_version 3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03, built on 06/29/2018 04:05 GMT
zk_avg_latency 0
zk_max_latency 17657
zk_min_latency 0
zk_packets_received 1427134
zk_packets_sent 1596974
zk_num_alive_connections 64
zk_outstanding_requests 0
zk_server_state follower
zk_znode_count 1394
zk_watch_count 592
zk_ephemerals_count 192
zk_approximate_data_size 181257
zk_open_file_descriptor_count 94
zk_max_file_descriptor_count 1048576
zk_fsync_threshold_exceed_count 1
我发现 zk_max_latency 值非常高。我想知道这是一种什么样的延迟?如何调试该值的原因?