Spring集成MQTT Subscriber(paho)停止处理消息

时间:2014-09-02 09:33:03

标签: spring-integration mqtt paho

我们在Spring集成中遇到了一个MQTT订户的问题(使用Paho MQTT Client 0.4.0在Tomcat 7上运行4.0.3.RELEASE)。

问题在于订阅者使用频繁使用的主题(大量消息)。向主题发送消息的设备是通过GPRS连接的现场设备。

Spring集成和代理(Mosquitto)在同一台服务器上运行。

在没有重新启动服务器的情况下在Tomcat上进行几次重新部署之后,似乎出现了这个问题。出现问题时,重新启动tomcat实例会将其修复一段时间。

这里是事件链(来自mosquitto日志。vdm-dev-live订阅者是有问题的订阅者):

开始弹簧集成时,我们会看到所有订阅者都连接到各种主题:

1409645645: New client connected from xxx.xx.xx.xxx as vdm-dev-live (c1, k60).
1409645645: Sending CONNACK to vdm-dev-live (0)
1409645645: Received SUBSCRIBE from vdm-dev-live
1409645645:     vdm/+/+/+/liveData (QoS 1)
1409645645: Sending SUBACK to vdm-dev-live
1409645645: New connection from xxx.xx.xx.xxx on port 1873.
1409645645: New client connected from xxx.xx.xx.xxx as vdm-dev-fmReq (c1, k60).
1409645645: Sending CONNACK to vdm-dev-fmReq (0)
1409645645: Received SUBSCRIBE from vdm-dev-fmReq
1409645645:     vdm/+/+/+/firmware/request (QoS 1)
1409645645: Sending SUBACK to vdm-dev-fmReq
1409645645: New connection from xxx.xx.xx.xxx on port 1873.
1409645645: New client connected from xxx.xx.xx.xxx as vdm-dev-cfgReq (c1, k60).
1409645645: Sending CONNACK to vdm-dev-cfgReq (0)
1409645645: Received SUBSCRIBE from vdm-dev-cfgReq
1409645645:     vdm/+/+/+/config/request (QoS 1)
1409645645: Sending SUBACK to vdm-dev-cfgReq
1409645645: New connection from xxx.xx.xx.xxx on port 1873.
1409645645: New client connected from xxx.xx.xx.xxx as vdm-dev-fmStat (c1, k60).
1409645645: Sending CONNACK to vdm-dev-fmStat (0)
1409645645: Received SUBSCRIBE from vdm-dev-fmStat
1409645645:     vdm/+/+/firmware/status (QoS 1)
1409645645: Sending SUBACK to vdm-dev-fmStat

我们看到消息来回传递

1409645646: Received PUBLISH from 89320292400015932480 (d0, q0, r0, m0, 'vdm/89320292400015932480/WVWZZZ1KZDP005350/4.2/liveData', ... (36 bytes))
1409645646: Sending PUBLISH to vdm-dev-live (d0, q0, r0, m0, 'vdm/89320292400015932480/WVWZZZ1KZDP005350/4.2/liveData', ... (36 bytes))
1409645646: Sending PUBLISH to Yo3zC8ou5y (d0, q0, r0, m0, 'vdm/89320292400015932480/WVWZZZ1KZDP005350/4.2/liveData', ... (36 bytes))
1409645646: Sending PUBLISH to mqttjs_31f1e3f7cd0e0aed (d0, q0, r0, m0, 'vdm/89320292400015932480/WVWZZZ1KZDP005350/4.2/liveData', ... (36 bytes))
1409645648: Received PUBLISH from 89320292400015932480 (d0, q0, r0, m0, 'vdm/89320292400015932480/WVWZZZ1KZDP005350/4.2/liveData', ... (36 bytes))
1409645648: Sending PUBLISH to vdm-dev-live (d0, q0, r0, m0, 'vdm/89320292400015932480/WVWZZZ1KZDP005350/4.2/liveData', ... (36 bytes))
1409645648: Sending PUBLISH to Yo3zC8ou5y (d0, q0, r0, m0, 'vdm/89320292400015932480/WVWZZZ1KZDP005350/4.2/liveData', ... (36 bytes))
1409645648: Sending PUBLISH to mqttjs_31f1e3f7cd0e0aed (d0, q0, r0, m0, 'vdm/89320292400015932480/WVWZZZ1KZDP005350/4.2/liveData', ... (36 bytes))
1409645650: Received PUBLISH from 89320292400015932480 (d0, q0, r0, m0, 'vdm/89320292400015932480/WVWZZZ1KZDP005350/4.2/liveData', ... (36 bytes))
1409645650: Sending PUBLISH to vdm-dev-live (d0, q0, r0, m0, 'vdm/89320292400015932480/WVWZZZ1KZDP005350/4.2/liveData', ... (36 bytes))
1409645650: Sending PUBLISH to Yo3zC8ou5y (d0, q0, r0, m0, 'vdm/89320292400015932480/WVWZZZ1KZDP005350/4.2/liveData', ... (36 bytes))
1409645650: Sending PUBLISH to mqttjs_31f1e3f7cd0e0aed (d0, q0, r0, m0, 'vdm/89320292400015932480/WVWZZZ1KZDP005350/4.2/liveData', ... (36 bytes))

我们看到来自各个订阅者的ping请求

1409645705: Received PINGREQ from vdm-dev-update
1409645705: Sending PINGRESP to vdm-dev-update
1409645705: Received PINGREQ from vdm-dev-live
1409645705: Sending PINGRESP to vdm-dev-live
1409645705: Received PINGREQ from vdm-dev-fmReq
1409645705: Sending PINGRESP to vdm-dev-fmReq
1409645705: Received PINGREQ from vdm-dev-cfgReq
1409645705: Sending PINGRESP to vdm-dev-cfgReq
1409645705: Received PINGREQ from vdm-dev-fmStat
1409645705: Sending PINGRESP to vdm-dev-fmStat

但突然间我们看到了这一点:

1409645776: Socket error on client vdm-dev-live, disconnecting.

此时用户已经死了。我们没有看到任何ping请求,也不再处理来自该主题的任何消息。在代理级别上,一切都还可以,因为我有调试日志订阅者(使用NodeJS),我看到那些订阅者仍然正在处理来自该主题的消息(因此问题出在订阅者级别)。

在tomcat日志中我们也看到了:

Sep 02, 2014 10:16:05 AM org.eclipse.paho.client.mqttv3.internal.ClientState checkForActivity
SEVERE: vdm-dev-live: Timed out as no activity, keepAlive=60,000 lastOutboundActivity=1,409,645,705,714 lastInboundActivity=1,409,645,755,075

但Paho没有对此用户进行任何清理/重启。

我也在Tomcat日志中看到了这一点:

SEVERE: The web application [/vdmapp] appears to have started a thread named [MQTT Snd: vdm-dev-live] but has failed to stop it. This is very likely to create a memory leak.

我还注意到该用户的很多线程在关机时卡住了。

"MQTT Snd: vdm-dev-live" daemon prio=10 tid=0x00007f1b44781800 nid=0x3061 in Object.wait() [0x00007f1aa7bfa000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    at java.lang.Thread.join(Thread.java:1258)
    - locked <0x00000007ab13e218> (a java.lang.Thread)
    at java.lang.Thread.join(Thread.java:1332)
    at org.eclipse.paho.client.mqttv3.internal.CommsReceiver.stop(CommsReceiver.java:77)
    - locked <0x00000007ab552730> (a java.lang.Object)
    at org.eclipse.paho.client.mqttv3.internal.ClientComms.shutdownConnection(ClientComms.java:294)
    at org.eclipse.paho.client.mqttv3.internal.CommsSender.handleRunException(CommsSender.java:154)
    at org.eclipse.paho.client.mqttv3.internal.CommsSender.run(CommsSender.java:131)
    at java.lang.Thread.run(Thread.java:722)

知道造成这种情况的原因以及如何防止这种情况发生?

2 个答案:

答案 0 :(得分:2)

跟随我对@ Artem的回答......

Paho客户端似乎陷入僵局。见你的要点第573行; Snd线程正在等待Rec线程终止。在第586行,Rec线程被阻止,因为入站队列已满(10)。对于所有看起来像这样的情况,没有Call个线程。因此,队列满状态永远不会被清除。请注意第227行,线程的三连接工作正常(可能是重新部署后的重新连接?)。

使用死线程时,没有Call线程。

我认为问题出在Paho客户端 - 在CommsCallback.run()方法中,Throwable上有一个catch,它会关闭连接,但由于队列已满,{{1}绝不通知线程(因此不会被清理)。因此,似乎消息传递抛出异常,如果队列已满,则会导致此死锁。

Paho客户端需要修复,但与此同时,我们可以弄清楚异常是什么。

如果异常位于入站网关的下游,您应该看到一个日志...

Rec

由于此日志是在 logger.error("Unhandled exception for " + message.toString(), e); 中生成的,如果您没有看到此类错误,则问题可能出在Paho客户端本身。

MqttCallback.messageArrived()中的异常处理看起来像这样......

CommsCallback

(他们应该调用 } catch (Throwable ex) { // Users code could throw an Error or Exception e.g. in the case // of class NoClassDefFoundError // @TRACE 714=callback threw exception log.fine(className, methodName, "714", null, ex); running = false; clientComms.shutdownConnection(null, new MqttException(ex)); } 来唤醒(垂死的)spaceAvailable.notifyAll()线程。)

因此,为Paho客户端启用FINE日志记录应该告诉您异常的位置/内容。

答案 1 :(得分:1)

首先,请分享Spring Integration和Paho Client的版本。

根据after doing a couple of redeploys我看到CommsReceiver#.stop()中的代码:

if (!Thread.currentThread().equals(recThread)) {
    try {
        // Wait for the thread to finish.
        recThread.join();
    }
    catch (InterruptedException ex) {
    }
}

Thread.join()

* Waits for this thread to die.

我真的不确定这意味着什么以及它应该如何进一步wait,但不会redeploy成为允许那些daemons的瓶颈继续活着,因为主线程没有死?