进程停留在SQS调用中

时间:2014-07-24 20:10:46

标签: python amazon-web-services boto amazon-sqs

我有一个python脚本,只是在循环中检查SQS上的消息然后停止。如果没有找到正在运行的脚本,则每隔几分钟由cron作业重新启动该脚本。

#start def main():
------For i from 1 to 100:
-------------Check SQS for new message[establish connections to SQS] # long polling not used, Receive message wait time set to 0.
-------------If new job found:
--------------------ProcessIt()
# end

我发现在EC2实例上运行脚本几天后,脚本变得陈旧,并且不会检查来自SQS的任何新消息。

当我为进程的pid运行lsof时,仅针对SQS连接进行grepping,我发现所有与SQS的连接都在CLOSE_WAIT上。我的问题的解决方法是手动终止并重启脚本进程。所以,似乎cron甚至无法重新启动脚本,因为它已经一直在运行并且停留在对SQS的调用中:

ip-10-x-y-z:~ # lsof -p 9018  | grep "72.21"

ld-linux. 9018 root    7u  IPv4 474699439      0t0       TCP ip-10-x-y-z.ec2.internal:58211->72.21.202.145:https (CLOSE_WAIT)

ld-linux. 9018 root   10u  IPv4 474699560      0t0       TCP ip-10-x-y-z.ec2.internal:53428->72.21.194.47:https (CLOSE_WAIT)

ld-linux. 9018 root   12u  IPv4 474701017      0t0       TCP ip-10-x-y-z.ec2.internal:52166->72.21.214.70:https (CLOSE_WAIT)

ld-linux. 9018 root   18u  IPv4 474694555      0t0       TCP ip-10-x-y-z.ec2.internal:57267->72.21.202.145:https (CLOSE_WAIT)

ld-linux. 9018 root   22u  IPv4 474694573      0t0       TCP ip-10-x-y-z.ec2.internal:57271->72.21.202.145:https (CLOSE_WAIT)

ld-linux. 9018 root   39u  IPv4 474701031      0t0       TCP ip-10-x-y-z.ec2.internal:52170->72.21.214.70:https (CLOSE_WAIT)

我知道我应该使用长轮询,但仍然想知道为什么这个过程会被卡住并且永远不会自行恢复。我正在使用Boto 2.23。

任何输入都会有所帮助。

1 个答案:

答案 0 :(得分:1)

gdb调试导致我卡住进程的以下回溯:

(gdb) pystack

~/mypackage/lib/python2.6/ssl.py (293): do_handshake 

~/mypackage/lib/python2.6/ssl.py (120): __init__ 

~/mypackage/lib/python2.6/ssl.py (350): wrap_socket 

~/mypackage/lib/python2.6/site-packages/boto/https_connection.py (118): connect 

~/mypackage/lib/python2.6/httplib.py (725): send 

~/mypackage/lib/python2.6/httplib.py (764): _send_output 

~/mypackage/lib/python2.6/httplib.py (892): endheaders 

~/mypackage/lib/python2.6/httplib.py (937): _send_request 

~/mypackage/lib/python2.6/httplib.py (899): request 

~/mypackage/lib/python2.6/site-packages/boto/connection.py (902): _mexe 

~/mypackage/lib/python2.6/site-packages/boto/connection.py (1063): make_request 

~/mypackage/lib/python2.6/site-packages/boto/connection.py (1138): get_object 

~/mypackage/lib/python2.6/site-packages/boto/sqs/connection.py (355): get_queue 

~/mypackage/lib/python2.6/site-packages/sqs/SQSHelper.py (96): __init__ 

~/mypackage/sqs/SQSWrapper.py (1229): main 

~/mypackage/sqs/SQSWrapper.py (1367): <module>

我们可以看到我的脚本停留在SQS的get_queue()API。

似乎问题是在python 2.6的ssl的握手函数中已经在python 2.7中修复了,但有人在python 2.7中报告了相同的问题[见下面的链接]。我将使用Python 2.7以及在SQS Wrapper代码中的SQS API上设置几分钟的超时来修复整个问题: 以下链接帮助我归结为根本原因和修复:

http://bugs.python.org/issue5103

http://hg.python.org/cpython/rev/ce4916ca06dd/

Web app hangs for several hours in ssl.py at self._sslobj.do_handshake()

Timeout function if it takes too long to finish