我们在Spring集成中遇到了一个MQTT订户的问题(使用Paho MQTT Client 0.4.0在Tomcat 7上运行4.0.3.RELEASE)。
问题在于订阅者使用频繁使用的主题(大量消息)。向主题发送消息的设备是通过GPRS连接的现场设备。
Spring集成和代理(Mosquitto)在同一台服务器上运行。
在没有重新启动服务器的情况下在Tomcat上进行几次重新部署之后,似乎出现了这个问题。出现问题时,重新启动tomcat实例会将其修复一段时间。
这里是事件链(来自mosquitto日志。vdm-dev-live
订阅者是有问题的订阅者):
开始弹簧集成时,我们会看到所有订阅者都连接到各种主题:
1409645645: New client connected from xxx.xx.xx.xxx as vdm-dev-live (c1, k60).
1409645645: Sending CONNACK to vdm-dev-live (0)
1409645645: Received SUBSCRIBE from vdm-dev-live
1409645645: vdm/+/+/+/liveData (QoS 1)
1409645645: Sending SUBACK to vdm-dev-live
1409645645: New connection from xxx.xx.xx.xxx on port 1873.
1409645645: New client connected from xxx.xx.xx.xxx as vdm-dev-fmReq (c1, k60).
1409645645: Sending CONNACK to vdm-dev-fmReq (0)
1409645645: Received SUBSCRIBE from vdm-dev-fmReq
1409645645: vdm/+/+/+/firmware/request (QoS 1)
1409645645: Sending SUBACK to vdm-dev-fmReq
1409645645: New connection from xxx.xx.xx.xxx on port 1873.
1409645645: New client connected from xxx.xx.xx.xxx as vdm-dev-cfgReq (c1, k60).
1409645645: Sending CONNACK to vdm-dev-cfgReq (0)
1409645645: Received SUBSCRIBE from vdm-dev-cfgReq
1409645645: vdm/+/+/+/config/request (QoS 1)
1409645645: Sending SUBACK to vdm-dev-cfgReq
1409645645: New connection from xxx.xx.xx.xxx on port 1873.
1409645645: New client connected from xxx.xx.xx.xxx as vdm-dev-fmStat (c1, k60).
1409645645: Sending CONNACK to vdm-dev-fmStat (0)
1409645645: Received SUBSCRIBE from vdm-dev-fmStat
1409645645: vdm/+/+/firmware/status (QoS 1)
1409645645: Sending SUBACK to vdm-dev-fmStat
我们看到消息来回传递
1409645646: Received PUBLISH from 89320292400015932480 (d0, q0, r0, m0, 'vdm/89320292400015932480/WVWZZZ1KZDP005350/4.2/liveData', ... (36 bytes))
1409645646: Sending PUBLISH to vdm-dev-live (d0, q0, r0, m0, 'vdm/89320292400015932480/WVWZZZ1KZDP005350/4.2/liveData', ... (36 bytes))
1409645646: Sending PUBLISH to Yo3zC8ou5y (d0, q0, r0, m0, 'vdm/89320292400015932480/WVWZZZ1KZDP005350/4.2/liveData', ... (36 bytes))
1409645646: Sending PUBLISH to mqttjs_31f1e3f7cd0e0aed (d0, q0, r0, m0, 'vdm/89320292400015932480/WVWZZZ1KZDP005350/4.2/liveData', ... (36 bytes))
1409645648: Received PUBLISH from 89320292400015932480 (d0, q0, r0, m0, 'vdm/89320292400015932480/WVWZZZ1KZDP005350/4.2/liveData', ... (36 bytes))
1409645648: Sending PUBLISH to vdm-dev-live (d0, q0, r0, m0, 'vdm/89320292400015932480/WVWZZZ1KZDP005350/4.2/liveData', ... (36 bytes))
1409645648: Sending PUBLISH to Yo3zC8ou5y (d0, q0, r0, m0, 'vdm/89320292400015932480/WVWZZZ1KZDP005350/4.2/liveData', ... (36 bytes))
1409645648: Sending PUBLISH to mqttjs_31f1e3f7cd0e0aed (d0, q0, r0, m0, 'vdm/89320292400015932480/WVWZZZ1KZDP005350/4.2/liveData', ... (36 bytes))
1409645650: Received PUBLISH from 89320292400015932480 (d0, q0, r0, m0, 'vdm/89320292400015932480/WVWZZZ1KZDP005350/4.2/liveData', ... (36 bytes))
1409645650: Sending PUBLISH to vdm-dev-live (d0, q0, r0, m0, 'vdm/89320292400015932480/WVWZZZ1KZDP005350/4.2/liveData', ... (36 bytes))
1409645650: Sending PUBLISH to Yo3zC8ou5y (d0, q0, r0, m0, 'vdm/89320292400015932480/WVWZZZ1KZDP005350/4.2/liveData', ... (36 bytes))
1409645650: Sending PUBLISH to mqttjs_31f1e3f7cd0e0aed (d0, q0, r0, m0, 'vdm/89320292400015932480/WVWZZZ1KZDP005350/4.2/liveData', ... (36 bytes))
我们看到来自各个订阅者的ping请求
1409645705: Received PINGREQ from vdm-dev-update
1409645705: Sending PINGRESP to vdm-dev-update
1409645705: Received PINGREQ from vdm-dev-live
1409645705: Sending PINGRESP to vdm-dev-live
1409645705: Received PINGREQ from vdm-dev-fmReq
1409645705: Sending PINGRESP to vdm-dev-fmReq
1409645705: Received PINGREQ from vdm-dev-cfgReq
1409645705: Sending PINGRESP to vdm-dev-cfgReq
1409645705: Received PINGREQ from vdm-dev-fmStat
1409645705: Sending PINGRESP to vdm-dev-fmStat
但突然间我们看到了这一点:
1409645776: Socket error on client vdm-dev-live, disconnecting.
此时用户已经死了。我们没有看到任何ping请求,也不再处理来自该主题的任何消息。在代理级别上,一切都还可以,因为我有调试日志订阅者(使用NodeJS),我看到那些订阅者仍然正在处理来自该主题的消息(因此问题出在订阅者级别)。
在tomcat日志中我们也看到了:
Sep 02, 2014 10:16:05 AM org.eclipse.paho.client.mqttv3.internal.ClientState checkForActivity
SEVERE: vdm-dev-live: Timed out as no activity, keepAlive=60,000 lastOutboundActivity=1,409,645,705,714 lastInboundActivity=1,409,645,755,075
但Paho没有对此用户进行任何清理/重启。
我也在Tomcat日志中看到了这一点:
SEVERE: The web application [/vdmapp] appears to have started a thread named [MQTT Snd: vdm-dev-live] but has failed to stop it. This is very likely to create a memory leak.
我还注意到该用户的很多线程在关机时卡住了。
"MQTT Snd: vdm-dev-live" daemon prio=10 tid=0x00007f1b44781800 nid=0x3061 in Object.wait() [0x00007f1aa7bfa000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1258)
- locked <0x00000007ab13e218> (a java.lang.Thread)
at java.lang.Thread.join(Thread.java:1332)
at org.eclipse.paho.client.mqttv3.internal.CommsReceiver.stop(CommsReceiver.java:77)
- locked <0x00000007ab552730> (a java.lang.Object)
at org.eclipse.paho.client.mqttv3.internal.ClientComms.shutdownConnection(ClientComms.java:294)
at org.eclipse.paho.client.mqttv3.internal.CommsSender.handleRunException(CommsSender.java:154)
at org.eclipse.paho.client.mqttv3.internal.CommsSender.run(CommsSender.java:131)
at java.lang.Thread.run(Thread.java:722)
知道造成这种情况的原因以及如何防止这种情况发生?
答案 0 :(得分:2)
跟随我对@ Artem的回答......
Paho客户端似乎陷入僵局。见你的要点第573行; Snd
线程正在等待Rec
线程终止。在第586行,Rec
线程被阻止,因为入站队列已满(10)。对于所有看起来像这样的情况,没有Call
个线程。因此,队列满状态永远不会被清除。请注意第227行,线程的三连接工作正常(可能是重新部署后的重新连接?)。
使用死线程时,没有Call
线程。
我认为问题出在Paho客户端 - 在CommsCallback.run()
方法中,Throwable
上有一个catch,它会关闭连接,但由于队列已满,{{1}绝不通知线程(因此不会被清理)。因此,似乎消息传递抛出异常,如果队列已满,则会导致此死锁。
Paho客户端需要修复,但与此同时,我们可以弄清楚异常是什么。
如果异常位于入站网关的下游,您应该看到一个日志...
Rec
由于此日志是在 logger.error("Unhandled exception for " + message.toString(), e);
中生成的,如果您没有看到此类错误,则问题可能出在Paho客户端本身。
MqttCallback.messageArrived()
中的异常处理看起来像这样......
CommsCallback
(他们应该调用 } catch (Throwable ex) {
// Users code could throw an Error or Exception e.g. in the case
// of class NoClassDefFoundError
// @TRACE 714=callback threw exception
log.fine(className, methodName, "714", null, ex);
running = false;
clientComms.shutdownConnection(null, new MqttException(ex));
}
来唤醒(垂死的)spaceAvailable.notifyAll()
线程。)
因此,为Paho客户端启用FINE日志记录应该告诉您异常的位置/内容。
答案 1 :(得分:1)
首先,请分享Spring Integration和Paho Client的版本。
根据after doing a couple of redeploys
我看到CommsReceiver#.stop()
中的代码:
if (!Thread.currentThread().equals(recThread)) {
try {
// Wait for the thread to finish.
recThread.join();
}
catch (InterruptedException ex) {
}
}
Thread.join()
:
* Waits for this thread to die.
我真的不确定这意味着什么以及它应该如何进一步wait
,但不会redeploy
成为允许那些daemons
的瓶颈继续活着,因为主线程没有死?