通过LAN交换机点亮节点之间的拓扑不稳定

时间:2018-04-25 15:15:03

标签: ignite

我正在设置Apache Ignite群集,并且在通过LAN交换机连接的两个以上节点连接时难以保持拓扑保持活动状态。 日志中报告了许多警告和问题,但我想知道开始尝试隔离问题的正确步骤是什么?两个方向上的Ping都可以正常工作,也可以在30秒或1米之后连接起作用,但是它们也会经常失去对方。有时,尝试连接的第3个节点会导致整个群集失败。

[20:41:34,761][WARNING][tcp-disco-msg-worker-#2][TcpDiscoverySpi] Node is out of topology (probably, due to short-time network problems).
[20:41:34,761][INFO][tcp-disco-sock-reader-#28][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/192.168.10.161:34361, rmtPort=34361
[20:41:34,762][WARNING][disco-event-worker-#161][GridDiscoveryManager] Local node SEGMENTED: TcpDiscoveryNode [id=dd44ea86-5302-47a0-b3c0-86acdcf7e771, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 192.168.10.162], sockAddrs=[/172.17.0.1:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, node_2/192.168.10.162:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1524656494760, loc=true, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
[20:41:34,764][INFO][tcp-disco-sock-reader-#14][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/192.168.10.1:55641, rmtPort=55641
[20:41:34,766][WARNING][disco-event-worker-#161][GridDiscoveryManager] Stopping local node according to configured segmentation policy.
[20:41:34,767][WARNING][disco-event-worker-#161][GridDiscoveryManager] Node FAILED: TcpDiscoveryNode [id=379eb246-e111-4510-a3f6-09554667d769, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 192.168.10.161], sockAddrs=[/172.17.0.1:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /192.168.10.161:47500], discPort=47500, order=2, intOrder=2, lastExchangeTime=1524656073909, loc=false, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
[20:41:34,768][INFO][disco-event-worker-#161][GridDiscoveryManager] Topology snapshot [ver=6, servers=2, clients=0, CPUs=60, heap=2.0GB]
[20:41:34,770][WARNING][disco-event-worker-#161][GridDiscoveryManager] Node FAILED: TcpDiscoveryNode [id=dd64661b-0679-4a14-9440-d876e5c35bd5, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 192.168.0.4, 192.168.10.3], sockAddrs=[/192.168.0.4:47500, /172.17.0.1:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /192.168.10.3:47500], discPort=47500, order=5, intOrder=4, lastExchangeTime=1524656176508, loc=false, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false]
[20:41:34,770][INFO][disco-event-worker-#161][GridDiscoveryManager] Topology snapshot [ver=7, servers=1, clients=0, CPUs=56, heap=1.0GB]
[20:41:34,771][INFO][Thread-3][GridTcpRestProtocol] Command protocol successfully stopped: TCP binary
[20:41:34,774][INFO][disco-event-worker-#161][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=7, minorTopVer=0], resVer=null, err=class org.apache.ignite.internal.IgniteInterruptedCheckedException: Node is stopping: null]
[20:41:34,774][INFO][Thread-3][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], resVer=null, err=class org.apache.ignite.internal.IgniteInterruptedCheckedException: Node is stopping: null]
[20:41:34,774][INFO][disco-event-worker-#161][GridDhtPartitionsExchangeFuture] Coordinator received all messages, try merge [ver=AffinityTopologyVersion [topVer=5, minorTopVer=0]]
[20:41:34,774][INFO][Thread-3][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], resVer=null, err=class org.apache.ignite.internal.IgniteInterruptedCheckedException: Node is stopping: null]
[20:41:34,774][INFO][disco-event-worker-#161][GridCachePartitionExchangeManager] Merge exchange future [curFut=AffinityTopologyVersion [topVer=5, minorTopVer=0], mergedFut=AffinityTopologyVersion [topVer=6, minorTopVer=0], evt=NODE_FAILED, evtNode=379eb246-e111-4510-a3f6-09554667d769, evtNodeClient=false]
[20:41:34,774][INFO][disco-event-worker-#161][GridCachePartitionExchangeManager] Merge exchange future [curFut=AffinityTopologyVersion [topVer=5, minorTopVer=0], mergedFut=AffinityTopologyVersion [topVer=7, minorTopVer=0], evt=NODE_FAILED, evtNode=dd64661b-0679-4a14-9440-d876e5c35bd5, evtNodeClient=false]
[20:41:34,774][INFO][disco-event-worker-#161][GridDhtPartitionsExchangeFuture] finishExchangeOnCoordinator [topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=7, minorTopVer=0]]
[20:41:34,787][INFO][Thread-3][GridCacheProcessor] Stopped cache [cacheName=ignite-sys-cache]
[20:41:34,803][INFO][Thread-3][IgniteKernal] 

>>> +---------------------------------------------------------------------------------+
>>> Ignite ver. 2.3.0#20171028-sha1:8add7fd5b501b40658096cdde48af9e948aa8150 stopped OK
>>> +---------------------------------------------------------------------------------+
>>> Grid uptime: 00:07:08.412


[root@node_2 apache-ignite-fabric-2.3.0-bin]# packet_write_wait: Connection to 192.168.10.162 port 22: Broken pipe

在其他一个节点上,经过一段时间后会显示这样的节点:

[22:45:54,026][SEVERE][grid-nio-worker-tcp-comm-6-#127][TcpCommunicationSpi] Failed to process selector key [ses=GridSelectorNioSessionImpl [worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=6, bytesRcvd=1578, bytesSent=5266, bytesRcvd0=0, bytesSent0=0, select=true, super=GridWorker [name=grid-nio-worker-tcp-comm-6, igniteInstanceName=null, finished=false, hashCode=733187042, interrupted=false, runner=grid-nio-worker-tcp-comm-6-#127]]], writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], inRecovery=GridNioRecoveryDescriptor [acked=4, resendCnt=0, rcvCnt=4, sentCnt=4, reserved=true, lastAck=4, nodeLeft=false, node=TcpDiscoveryNode [id=dd64661b-0679-4a14-9440-d876e5c35bd5, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 192.168.0.4, 192.168.10.3], sockAddrs=[/192.168.0.4:47500, /172.17.0.1:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /192.168.10.3:47500], discPort=47500, order=8, intOrder=5, lastExchangeTime=1524656494855, loc=false, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false], connected=true, connectCnt=0, queueLimit=4096, reserveCnt=1, pairedConnections=false], outRecovery=GridNioRecoveryDescriptor [acked=4, resendCnt=0, rcvCnt=4, sentCnt=4, reserved=true, lastAck=4, nodeLeft=false, node=TcpDiscoveryNode [id=dd64661b-0679-4a14-9440-d876e5c35bd5, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 192.168.0.4, 192.168.10.3], sockAddrs=[/192.168.0.4:47500, /172.17.0.1:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /192.168.10.3:47500], discPort=47500, order=8, intOrder=5, lastExchangeTime=1524656494855, loc=false, ver=2.3.0#20171028-sha1:8add7fd5, isClient=false], connected=true, connectCnt=0, queueLimit=4096, reserveCnt=1, pairedConnections=false], super=GridNioSessionImpl [locAddr=/192.168.10.161:47100, rmtAddr=/192.168.10.1:47884, createTime=1524656504308, closeTime=0, bytesSent=5266, bytesRcvd=1578, bytesSent0=0, bytesRcvd0=0, sndSchedTime=1524663359458, lastSndTime=1524656672249, lastRcvTime=1524663359458, readsPaused=false, filterChain=FilterChain[filters=[GridNioCodecFilter [parser=o.a.i.i.util.nio.GridDirectParser@32244b13, directMode=true], GridConnectionBytesVerifyFilter], accepted=true]]]
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
    at sun.nio.ch.IOUtil.read(IOUtil.java:192)
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
    at org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processRead(GridNioServer.java:1233)
    at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2272)
    at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2048)
    at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1717)
    at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
    at java.lang.Thread.run(Thread.java:748)
[22:45:54,027][WARNING][grid-nio-worker-tcp-comm-6-#127][TcpCommunicationSpi] Closing NIO session because of unhandled exception [cls=class o.a.i.i.util.nio.GridNioException, msg=Connection reset by peer]
[22:46:41,002][INFO][grid-timeout-worker-#119][IgniteKernal] 

我知道应该从哪里开始寻找问题的原因吗?

谢谢!

1 个答案:

答案 0 :(得分:3)

如警告中所述

[20:41:34,761][WARNING][tcp-disco-msg-worker-#2][TcpDiscoverySpi] Node is out of topology (probably, due to short-time network problems).

原因可能是网络问题。 Ping可以正常工作(虽然我会在足够长的时间间隔内检查故障率,例如10-15分钟),但也可以尝试长时间运行的TCP连接(可能通过netcat或其他东西)。 / p>

另一个可能的原因是节点上的负载很高。例如。如果一个节点进入了一个停止世界的GC并且无法长时间响应,它也可能会被赶出集群。

要使群集更能容忍短时间网络和响应问题,请尝试增加IgniteConfiguration.failureDetectionTimeout设置。