Question

我正在使用Ignite.NET 2.7.6。一台服务器和大约40个客户端进行配置。工作8小时后，服务器开始表现异常：客户端无法连接它，某些查询没有结果，等等。

在服务器方面，内存消耗正常，线程数量约为250，并且一切正常。我看不到任何问题，因此我决定解决服务器端所有标记为“严重”的问题。

我遇到的第一个是：

已检测到系统关键线程已阻塞。这可能导致群集范围内的未定义行为[threadName = tcp-comm-worker，blockedFor = 13s]

所以我想了解这种情况发生的原因。完整的服务器日志可以在这里找到：

https://yadi.sk/d/LF03Vz5vz4tRcw

https://yadi.sk/d/MMe0xrgI3k6lkA

已添加：这个问题似乎并不是无害的，该消息从各个线程每秒出现一次，“ blockedFor”值从几秒钟增加到几小时。

服务器上的负载很低，但是随着服务器线程被锁定，它将停止响应并注册新客户端。

以下是来自服务器的日志：

https://yadi.sk/d/tc3g2hb9B0jtvg

https://yadi.sk/d/05YrlYXcp4xPqg

这是来自一个客户端的日志：

https://yadi.sk/d/bcbQ7ee4PUzq2w

重新启动服务器时，客户端日志的最后几行位于19:03:52。

Answer 1

我看到了以下.NET特定的异常，但是它应该由另一个问题触发。无论如何，这个是reported to the community。

    class org.apache.ignite.IgniteException: Platform error:System.NullReferenceException: Ññûëêà íà îáúåêò íå óêàçûâàåò íà ýêçåìïëÿð îáúåêòà.
   â Apache.Ignite.Core.Impl.Unmanaged.UnmanagedCallbacks.CacheEntryFilterApply(Int64 memPtr)
   â Apache.Ignite.Core.Impl.Unmanaged.UnmanagedCallbacks.InLongOutLong(Int32 type, Int64 val)
    at org.apache.ignite.internal.processors.platform.PlatformProcessorImpl.loggerLog(PlatformProcessorImpl.java:404)
    at org.apache.ignite.internal.processors.platform.PlatformProcessorImpl.processInStreamOutLong(PlatformProcessorImpl.java:460)
    at org.apache.ignite.internal.processors.platform.PlatformProcessorImpl.processInStreamOutLong(PlatformProcessorImpl.java:512)
    at org.apache.ignite.internal.processors.platform.PlatformTargetProxyImpl.inStreamOutLong(PlatformTargetProxyImpl.java:67)
    at org.apache.ignite.internal.processors.platform.callback.PlatformCallbackUtils.inLongOutLong(Native Method)
    at org.apache.ignite.internal.processors.platform.callback.PlatformCallbackGateway.cacheEntryFilterApply(PlatformCallbackGateway.java:143)
    at org.apache.ignite.internal.processors.platform.cache.PlatformCacheEntryFilterImpl.apply(PlatformCacheEntryFilterImpl.java:70)
    at org.apache.ignite.internal.processors.cache.query.GridCacheQueryManager$InternalScanFilter.apply(GridCacheQueryManager.java:3139)

第一个例外与网络级别的通信问题有关。见下文：

java.io.IOException: Óäàëåííûé õîñò ïðèíóäèòåëüíî ðàçîðâàë ñóùåñòâóþùåå ïîäêëþ÷åíèå
    at sun.nio.ch.SocketDispatcher.read0(Native Method)
    at sun.nio.ch.SocketDispatcher.read(Unknown Source)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
    at sun.nio.ch.IOUtil.read(Unknown Source)
    at sun.nio.ch.SocketChannelImpl.read(Unknown Source)
    at org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processRead(GridNioServer.java:1282)
    at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2386)
    at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2153)
    at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1794)
    at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
    at java.lang.Thread.run(Unknown Source)
[18:46:12,846][WARNING][grid-nio-worker-tcp-comm-0-#48][TcpCommunicationSpi] Closing NIO session because of unhandled exception [cls=class o.a.i.i.util.nio.GridNioException, msg=Óäàëåííûé õîñò ïðèíóäèòåëüíî ðàçîðâàë ñóùåñòâóþùåå ïîäêëþ÷åíèå]
[18:46:13,861][WARNING][tcp-comm-worker-#1][TcpCommunicationSpi] Connect timed out (consider increasing 'failureDetectionTimeout' configuration property) [addr=/127.0.0.1:47101, failureDetectionTimeout=10000]
[18:46:14,893][WARNING][tcp-comm-worker-#1][TcpCommunicationSpi] Connect timed out (consider increasing 'failureDetectionTimeout' configuration property) [addr=BB-SRV-DELTA/169.254.40.231:47101, failureDetectionTimeout=10000]

服务器或某些客户端似乎在10秒内对心跳或其他网络请求没有反应。还要检查客户端节点的日志。您可能需要扩展集群以添加更多服务器，以实现负载平衡或调整failureDetectionTimeou。

Blocked system-critical thread has been detected...错误消息是无害的但令人困惑。我已经重新启动了following conversation。

Answer 2

如Denis所述，存在很多网络通信问题。

通常，客户端希望执行某些缓存操作，但是条带化池中的服务器线程被长时间阻止。我认为这与.NET部分无关。

您可以看到以下消息：

[18:53:04,385][SEVERE][tcp-disco-msg-worker-#2][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=sys-stripe-7, blockedFor=13s]

如果您查看线程：

hread [name="sys-stripe-7-#8", id=28, state=WAITING, blockCnt=51, waitCnt=3424]
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.park(Unknown Source)
        at o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178)
        at o.a.i.i.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141)
        at o.a.i.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2911)
        at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2713)
        at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2672)
        at o.a.i.i.managers.communication.GridIoManager.send(GridIoManager.java:1656)
        at o.a.i.i.managers.communication.GridIoManager.sendOrderedMessage(GridIoManager.java:1879)
        at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1904)
        at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1875)
        at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1857)
        at o.a.i.i.processors.continuous.GridContinuousProcessor.sendNotification(GridContinuousProcessor.java:1275)
        at o.a.i.i.processors.continuous.GridContinuousProcessor.addNotification(GridContinuousProcessor.java:1212)

该线程正在尝试发送连续查询回调，但是未能建立与客户端节点的连接。这将导致线程被阻塞，并且无法为其他需要相同分区的缓存API请求提供服务。

乍一看，您可以尝试减少#clientFailureDetectionTimeout，默认值为30秒。但这不能完全解决网络问题。

已检测到阻塞的系统关键线程

2 个答案: