Question

我尝试将jgroups群集从版本3.2.9升级到4.0.8。升级后，我面临两个问题：

即使节点离开集群（使用-9终止），也不会清除添加到JGROUPSPING表的条目。因此，当我们重新开始维护时，表中的条目数量会增加。群集中的更多条目也导致群集很晚才出现。
在JGROUPSPING表中有更多条目，集群未正确形成，导致分区，并且也没有发生合并。因为节点彼此不知道导致功能问题。我也在观察FLUSH超时是间歇性的几个节点。

TCP堆栈详细信息为：

<TCP_NIO2 bind_port="7800" recv_buf_size="${tcp.recv_buf_size:20M}" send_buf_size="${tcp.send_buf_size:640K}" max_bundle_size="64K" sock_conn_timeout="300" thread_pool.enabled="false" thread_pool.min_threads="0" thread_pool.max_threads="0" thread_pool.keep_alive_time="5000"/>

<JDBC_PING connection_username="root" connection_driver="com.mysql.jdbc.Driver" connection_password="" connection_url="jdbc:mysql://localhost/db" />

<MERGE3 min_interval="5000" max_interval="10000"/>

<FD_SOCK suspect_msg_interval="10000" start_port="7900" port_range="10"/>

<FD timeout="20000" max_tries="3" />

<VERIFY_SUSPECT timeout="15000" num_msgs="3"/>

<pbcast.NAKACK2 use_mcast_xmit="false"
                xmit_interval="1000"
                log_not_found_msgs="true"
                discard_delivered_msgs="true"/>

<UNICAST3 log_not_found_msgs="true"  xmit_interval="1000"/>

<pbcast.STABLE stability_delay="100" desired_avg_gossip="60000" max_bytes="5M"/>

<pbcast.GMS print_local_addr="true"
            join_timeout="5000"
            leave_timeout="1000"/>

<FRAG2 frag_size="60K"  />

Answer 1

<pbcast.GMS print_local_addr="true"
                merge_timeout="15000"
                join_timeout="5000"
                max_join_attempts="5"
                leave_timeout="1000"/>

<pbcast.FLUSH start_flush_timeout="5000"/>

视图合并分两步进行，一个Flush操作并从每个候选视图的所有成员收集合并响应。

因此实际的合并超时是这些操作超时的总和。

merge_timeout = start_flush_timeout + merge_response_timeout;
//merge_response_timeout is an internally calculated.
merge_response_timeout = merge_timeout/2;

因此，要配置合并超时值，还必须考虑start_flush_timeout。

例如我的配置是

merge_timeout = 15000 and  start_flush_timeout  = 10000;
So actual merge operation was taking (10000 + 15000/2) = 17500 which timed out after 15000 as configured with merge_timeout

将我的配置更改为 merge_timeout = 15000 和 start_flush_timeout = 5000 ，因此总超时低于 5000 +（15000/2）= 12500 小于总时间15000解决了我的问题。

重新启动

1 个答案: