我正在使用Ignite.NET 2.7.6。一台服务器和大约40个客户端进行配置。工作8小时后,服务器开始表现异常:客户端无法连接它,某些查询没有结果,等等。
在服务器方面,内存消耗正常,线程数量约为250,并且一切正常。我看不到任何问题,因此我决定解决服务器端所有标记为“严重”的问题。
我遇到的第一个是:
已检测到系统关键线程已阻塞。这可能导致群集范围内的未定义行为[threadName = tcp-comm-worker,blockedFor = 13s]
所以我想了解这种情况发生的原因。 完整的服务器日志可以在这里找到:
https://yadi.sk/d/LF03Vz5vz4tRcw
https://yadi.sk/d/MMe0xrgI3k6lkA
已添加: 这个问题似乎并不是无害的,该消息从各个线程每秒出现一次,“ blockedFor”值从几秒钟增加到几小时。
服务器上的负载很低,但是随着服务器线程被锁定,它将停止响应并注册新客户端。
以下是来自服务器的日志:
https://yadi.sk/d/tc3g2hb9B0jtvg
https://yadi.sk/d/05YrlYXcp4xPqg
这是来自一个客户端的日志:
https://yadi.sk/d/bcbQ7ee4PUzq2w
重新启动服务器时,客户端日志的最后几行位于19:03:52。
答案 0 :(得分:0)
我看到了以下.NET特定的异常,但是它应该由另一个问题触发。无论如何,这个是reported to the community。
class org.apache.ignite.IgniteException: Platform error:System.NullReferenceException: Ññûëêà íà îáúåêò íå óêàçûâàåò íà ýêçåìïëÿð îáúåêòà.
â Apache.Ignite.Core.Impl.Unmanaged.UnmanagedCallbacks.CacheEntryFilterApply(Int64 memPtr)
â Apache.Ignite.Core.Impl.Unmanaged.UnmanagedCallbacks.InLongOutLong(Int32 type, Int64 val)
at org.apache.ignite.internal.processors.platform.PlatformProcessorImpl.loggerLog(PlatformProcessorImpl.java:404)
at org.apache.ignite.internal.processors.platform.PlatformProcessorImpl.processInStreamOutLong(PlatformProcessorImpl.java:460)
at org.apache.ignite.internal.processors.platform.PlatformProcessorImpl.processInStreamOutLong(PlatformProcessorImpl.java:512)
at org.apache.ignite.internal.processors.platform.PlatformTargetProxyImpl.inStreamOutLong(PlatformTargetProxyImpl.java:67)
at org.apache.ignite.internal.processors.platform.callback.PlatformCallbackUtils.inLongOutLong(Native Method)
at org.apache.ignite.internal.processors.platform.callback.PlatformCallbackGateway.cacheEntryFilterApply(PlatformCallbackGateway.java:143)
at org.apache.ignite.internal.processors.platform.cache.PlatformCacheEntryFilterImpl.apply(PlatformCacheEntryFilterImpl.java:70)
at org.apache.ignite.internal.processors.cache.query.GridCacheQueryManager$InternalScanFilter.apply(GridCacheQueryManager.java:3139)
第一个例外与网络级别的通信问题有关。见下文:
java.io.IOException: Óäàëåííûé õîñò ïðèíóäèòåëüíî ðàçîðâàë ñóùåñòâóþùåå ïîäêëþ÷åíèå
at sun.nio.ch.SocketDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(Unknown Source)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
at sun.nio.ch.IOUtil.read(Unknown Source)
at sun.nio.ch.SocketChannelImpl.read(Unknown Source)
at org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processRead(GridNioServer.java:1282)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2386)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2153)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1794)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
at java.lang.Thread.run(Unknown Source)
[18:46:12,846][WARNING][grid-nio-worker-tcp-comm-0-#48][TcpCommunicationSpi] Closing NIO session because of unhandled exception [cls=class o.a.i.i.util.nio.GridNioException, msg=Óäàëåííûé õîñò ïðèíóäèòåëüíî ðàçîðâàë ñóùåñòâóþùåå ïîäêëþ÷åíèå]
[18:46:13,861][WARNING][tcp-comm-worker-#1][TcpCommunicationSpi] Connect timed out (consider increasing 'failureDetectionTimeout' configuration property) [addr=/127.0.0.1:47101, failureDetectionTimeout=10000]
[18:46:14,893][WARNING][tcp-comm-worker-#1][TcpCommunicationSpi] Connect timed out (consider increasing 'failureDetectionTimeout' configuration property) [addr=BB-SRV-DELTA/169.254.40.231:47101, failureDetectionTimeout=10000]
服务器或某些客户端似乎在10秒内对心跳或其他网络请求没有反应。还要检查客户端节点的日志。您可能需要扩展集群以添加更多服务器,以实现负载平衡或调整failureDetectionTimeou
。
Blocked system-critical thread has been detected...
错误消息是无害的但令人困惑。我已经重新启动了following conversation。
答案 1 :(得分:0)
如Denis所述,存在很多网络通信问题。
通常,客户端希望执行某些缓存操作,但是条带化池中的服务器线程被长时间阻止。我认为这与.NET部分无关。
您可以看到以下消息:
[18:53:04,385][SEVERE][tcp-disco-msg-worker-#2][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=sys-stripe-7, blockedFor=13s]
如果您查看线程:
hread [name="sys-stripe-7-#8", id=28, state=WAITING, blockCnt=51, waitCnt=3424]
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(Unknown Source)
at o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178)
at o.a.i.i.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141)
at o.a.i.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2911)
at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2713)
at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2672)
at o.a.i.i.managers.communication.GridIoManager.send(GridIoManager.java:1656)
at o.a.i.i.managers.communication.GridIoManager.sendOrderedMessage(GridIoManager.java:1879)
at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1904)
at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1875)
at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1857)
at o.a.i.i.processors.continuous.GridContinuousProcessor.sendNotification(GridContinuousProcessor.java:1275)
at o.a.i.i.processors.continuous.GridContinuousProcessor.addNotification(GridContinuousProcessor.java:1212)
该线程正在尝试发送连续查询回调,但是未能建立与客户端节点的连接。这将导致线程被阻塞,并且无法为其他需要相同分区的缓存API请求提供服务。
乍一看,您可以尝试减少#clientFailureDetectionTimeout
,默认值为30秒。但这不能完全解决网络问题。