我正在尝试对在EC2上的Ubuntu 11.04上的龙卷风2.4上运行的应用进行故障排除。它似乎定期达到100%CPU,并在该请求停止几秒钟。
非常感谢任何帮助。
症状:
00:00:00 GET /some/request () 00:00:09 GET /next/request (9000ms) 00:00:00 GET /some/request () 00:00:09 GET /next/request (1ms) # 9 seconds gap in requests is certainly not possible as clients are constantly polling.
龙卷风在nginx后面运行。
在最可能停止时发送SIGINT,每次都给出不同的堆栈跟踪。其中一些如下:
Traceback (most recent call last): File "chat/main.py", line 3396, in <module> main() File "chat/main.py", line 3392, in main tornado.ioloop.IOLoop.instance().start() File "/home/ubuntu/tornado/tornado/ioloop.py", line 515, in start self._run_callback(callback) File "/home/ubuntu/tornado/tornado/ioloop.py", line 370, in _run_callback callback() File "/home/ubuntu/tornado/tornado/stack_context.py", line 216, in wrapped callback(*args, **kwargs) File "/home/ubuntu/tornado/tornado/iostream.py", line 303, in wrapper callback(*args) File "/home/ubuntu/tornado/tornado/stack_context.py", line 216, in wrapped callback(*args, **kwargs) File "/home/ubuntu/tornado/tornado/httpserver.py", line 298, in _on_request_body self.request_callback(self._request) File "/home/ubuntu/tornado/tornado/web.py", line 1421, in __call__ handler = spec.handler_class(self, request, **spec.kwargs) File "/home/ubuntu/tornado/tornado/web.py", line 126, in __init__ application.ui_modules.iteritems()) File "/home/ubuntu/tornado/tornado/web.py", line 125, in <genexpr> self.ui["_modules"] = ObjectDict((n, self._ui_module(n, m)) for n, m in File "/home/ubuntu/tornado/tornado/web.py", line 1114, in _ui_module def _ui_module(self, name, module): KeyboardInterrupt Traceback (most recent call last): File "chat/main.py", line 3398, in <module> main() File "chat/main.py", line 3394, in main tornado.ioloop.IOLoop.instance().start() File "/home/ubuntu/tornado/tornado/ioloop.py", line 515, in start self._run_callback(callback) File "/home/ubuntu/tornado/tornado/ioloop.py", line 370, in _run_callback callback() File "/home/ubuntu/tornado/tornado/stack_context.py", line 216, in wrapped callback(*args, **kwargs) File "/home/ubuntu/tornado/tornado/iostream.py", line 303, in wrapper callback(*args) File "/home/ubuntu/tornado/tornado/stack_context.py", line 216, in wrapped callback(*args, **kwargs) File "/home/ubuntu/tornado/tornado/httpserver.py", line 285, in _on_headers self.request_callback(self._request) File "/home/ubuntu/tornado/tornado/web.py", line 1408, in __call__ transforms = [t(request) for t in self.transforms] File "/home/ubuntu/tornado/tornado/web.py", line 1811, in __init__ def __init__(self, request): KeyboardInterrupt Traceback (most recent call last): File "chat/main.py", line 3351, in <module> main() File "chat/main.py", line 3347, in main tornado.ioloop.IOLoop.instance().start() File "/home/ubuntu/tornado/tornado/ioloop.py", line 571, in start self._handlers[fd](fd, events) File "/home/ubuntu/tornado/tornado/stack_context.py", line 216, in wrapped callback(*args, **kwargs) File "/home/ubuntu/tornado/tornado/netutil.py", line 342, in accept_handler callback(connection, address) File "/home/ubuntu/tornado/tornado/netutil.py", line 237, in _handle_connection self.handle_stream(stream, address) File "/home/ubuntu/tornado/tornado/httpserver.py", line 156, in handle_stream self.no_keep_alive, self.xheaders, self.protocol) File "/home/ubuntu/tornado/tornado/httpserver.py", line 183, in __init__ self.stream.read_until(b("\r\n\r\n"), self._header_callback) File "/home/ubuntu/tornado/tornado/iostream.py", line 139, in read_until self._try_inline_read() File "/home/ubuntu/tornado/tornado/iostream.py", line 385, in _try_inline_read if self._read_to_buffer() == 0: File "/home/ubuntu/tornado/tornado/iostream.py", line 401, in _read_to_buffer chunk = self.read_from_fd() File "/home/ubuntu/tornado/tornado/iostream.py", line 632, in read_from_fd chunk = self.socket.recv(self.read_chunk_size) KeyboardInterrupt
非常感谢有关如何解决此问题的任何提示。
进一步观察:
strace -p,在它挂起的时间内显示空输出。
挂起期间的ltrace -p仅显示 free()调用: 免费(0x6fa70080)= 免费(0x1175f8060)= 免费(0x117a5c370)=
答案 0 :(得分:1)
听起来你正遭受垃圾收集(GC)风暴。您描述的行为是该诊断的典型行为, ltrace 进一步支持该假设。
在您使用的主要/事件循环中正在分配和处理大量对象......并且会产生对 free()的定期调用。
一种可能的方法是分析您的代码(或您所依赖的库),看看是否可以重构它以使用(和重用)预分配池中的对象。
另一种可能的缓解措施是让你自己的,更频繁的调用来触发垃圾收集 - 总体上更昂贵,但每次调用的成本可能更低。 (这将是更可预测的吞吐量的权衡)。
您可以使用Python: gc module更深入地调查问题(使用 gc.set_debug())和简单的尝试缓解(调用 gc例如,每次交易后.collect()。您也可以尝试使用 gc.disable()运行应用程序一段合理的时间,以查看是否进一步暗示Python垃圾收集器。请注意,长时间禁用垃圾收集器几乎肯定会导致分页/交换...因此仅将其用于验证我们的假设,并且不要期望以任何有意义的方式解决问题。它可能只是推迟问题,直到整个系统颠簸并需要重新启动。
以下是在Tornado上的另一个SO线程中使用 gc.collect()的示例:SO: Tornado memory leak on dropped connections