Question

我们有三个必须在群集中的服务。因此，我们使用Infinispan来集群节点并在这些服务之间共享数据。成功重新启动后，有时我会收到异常，并在我的其他节点中收到了“已更改”事件。实际上所有节点都在运行。我无法弄清楚原因。

我正在使用Infinispan 8.1.3分布式缓存，jgroups-3.4

org.infinispan.util.concurrent.TimeoutException: Replication timeout for sipproxy-16964
            at org.infinispan.remoting.transport.jgroups.JGroupsTransport.checkRsp(JGroupsTransport.java:765)
            at org.infinispan.remoting.transport.jgroups.JGroupsTransport.lambda$invokeRemotelyAsync$80(JGroupsTransport.java:599)
            at org.infinispan.remoting.transport.jgroups.JGroupsTransport$$Lambda$9/1547262581.apply(Unknown Source)
            at java.util.concurrent.CompletableFuture$ThenApply.run(CompletableFuture.java:717)
            at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:193)
            at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2345)
            at org.infinispan.remoting.transport.jgroups.SingleResponseFuture.call(SingleResponseFuture.java:46)
            at org.infinispan.remoting.transport.jgroups.SingleResponseFuture.call(SingleResponseFuture.java:17)
            at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
            at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            at java.lang.Thread.run(Thread.java:745)
    2017-08-22 04:44:52,902 INFO  [JGroupsTransport] (ViewHandler,ISPN,transport_manager-48870) ISPN000094: Received new cluster view for channel ISPN: [transport_manager-48870|3] (2) [transport_manager-48870, mediaproxy-47178]
    2017-08-22 04:44:52,949 WARN  [PreferAvailabilityStrategy] (transport-thread-transport_manager-p4-t24) ISPN000313: Cache mediaProxyResponseCache lost data because of abrupt leavers [sipproxy-16964]
    2017-08-22 04:44:52,951 WARN  [ClusterTopologyManagerImpl] (transport-thread-transport_manager-p4-t24) ISPN000197: Error updating cluster member list
    java.lang.IllegalArgumentException: There must be at least one node with a non-zero capacity factor
            at org.infinispan.distribution.ch.impl.DefaultConsistentHashFactory.checkCapacityFactors(DefaultConsistentHashFactory.java:57)
            at org.infinispan.distribution.ch.impl.DefaultConsistentHashFactory.updateMembers(DefaultConsistentHashFactory.java:74)
            at org.infinispan.distribution.ch.impl.DefaultConsistentHashFactory.updateMembers(DefaultConsistentHashFactory.java:26)
            at org.infinispan.topology.ClusterCacheStatus.updateCurrentTopology(ClusterCacheStatus.java:431)
            at org.infinispan.partitionhandling.impl.PreferAvailabilityStrategy.onClusterViewChange(PreferAvailabilityStrategy.java:56)
            at org.infinispan.topology.ClusterCacheStatus.doHandleClusterView(ClusterCacheStatus.java:337)
            at org.infinispan.topology.ClusterTopologyManagerImpl.updateCacheMembers(ClusterTopologyManagerImpl.java:397)
            at org.infinispan.topology.ClusterTopologyManagerImpl.handleClusterView(ClusterTopologyManagerImpl.java:314)
            at org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener$1.run(ClusterTopologyManagerImpl.java:571)
            at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
            at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            at java.lang.Thread.run(Thread.java:745)

jgroups.xml：

<config xmlns="urn:org:jgroups"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.4.xsd">
    <TCP bind_addr="131.10.20.16"
         bind_port="8010" port_range="10"
         recv_buf_size="20000000"
         send_buf_size="640000"
         loopback="false"
         max_bundle_size="64k"
         bundler_type="old"
         enable_diagnostics="true"
         thread_naming_pattern="cl"
         timer_type="new"
         timer.min_threads="4"
         timer.max_threads="30"
         timer.keep_alive_time="3000"
         timer.queue_max_size="100"
         timer.wheel_size="200"
         timer.tick_time="50"
         thread_pool.enabled="true"
         thread_pool.min_threads="2"
         thread_pool.max_threads="30"
         thread_pool.keep_alive_time="5000"
         thread_pool.queue_enabled="true"
         thread_pool.queue_max_size="100"
         thread_pool.rejection_policy="discard"

         oob_thread_pool.enabled="true"
         oob_thread_pool.min_threads="2"
         oob_thread_pool.max_threads="30"
         oob_thread_pool.keep_alive_time="5000"
         oob_thread_pool.queue_enabled="false"
         oob_thread_pool.queue_max_size="100"
         oob_thread_pool.rejection_policy="discard"/>
        <TCPPING initial_hosts="131.10.20.16[8010],131.10.20.17[8010],131.10.20.182[8010]" port_range="2"
         timeout="3000" num_initial_members="3" />

    <MERGE3 max_interval="30000"
            min_interval="10000"/>

    <FD_SOCK/>
    <FD_ALL interval="3000" timeout="10000" />
    <VERIFY_SUSPECT timeout="500"  />
    <BARRIER />
    <pbcast.NAKACK use_mcast_xmit="false"
                   retransmit_timeout="100,300,600,1200"
                   discard_delivered_msgs="true" />
    <UNICAST3 conn_expiry_timeout="0"/>

    <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
                   max_bytes="10m"/>
    <pbcast.GMS print_local_addr="true" join_timeout="5000"
                max_bundling_time="30"
                view_bundling="true"/>
    <UFC max_credits="2M"
         min_threshold="0.4"/>
    <MFC max_credits="2M"
         min_threshold="0.4"/>
    <FRAG2 frag_size="60000"  />
    <pbcast.STATE_TRANSFER/>
</config>

Answer 1

TimeoutException仅表示在超时内没有收到RPC的响应，仅此而已。当服务器处于压力之下时可能会发生这种情况，但这可能不是这种情况 - 以下日志表明该节点被“怀疑” - 该节点可能没有响应超过10秒（这是您的配置中的限制，见FD_ALL）。

首先检查该服务器中的日志是否存在错误，并检查GC日志是否存在任何世界停留。

Answer 2

正如@flavius建议的那样，主要原因是您的某个节点由于某种原因而停止，并且无法回复RPC。

我建议更改JGroups的日志记录级别，以便您可以看到为什么某个节点被怀疑（可能由FD_SOCK或FD_ALL协议发生）以及为什么它从视图中被删除（这很可能是因为VERIFY_SUSPECT协议而发生的。）

您还可以检查发生的原因。在大多数情况下，它是由长时间的GC暂停引起的。但是出于其他原因，主机也可以暂停您的VM。我建议在两个VM中使用JHiccup并将其作为Java代理附加到您的进程中。这样你应该注意它的JVM Stop The World是否导致了这个或者是操作系统。

org.infinispan.util.concurrent.TimeoutException：“node-name”的复制超时

2 个答案: