我正在运行一个带有eventlet的flask-socketio应用程序,其中所有套接字都会发送到特定的房间。每个房间代表一个聊天区域,房间名称是来自flask-sqlalchemy表格的chat_uuid。当用户进入应用程序的聊天页面时,他们将进入聊天室。我在本地和在带有NGINX和Gunicorn的AWS ec2上运行此应用程序。错误是用户偶尔无法接收消息。例如,我打开了40-50个chrome标签并发送了一条消息,只有一个标签未能收到消息。起初,我认为这个错误与我在控制台中获得的堆栈跟踪有关。
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/eventlet/wsgi.py", line 539, in handle_one_response
result = self.application(self.environ, start_response)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/flask/app.py", line 1997, in __call__
return self.wsgi_app(environ, start_response)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/flask_socketio/__init__.py", line 43, in __call__
start_response)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/engineio/middleware.py", line 47, in __call__
return self.engineio_app.handle_request(environ, start_response)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/socketio/server.py", line 360, in handle_request
return self.eio.handle_request(environ, start_response)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/engineio/server.py", line 274, in handle_request
environ, start_response)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/engineio/socket.py", line 91, in handle_get_request
start_response)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/engineio/socket.py", line 133, in _upgrade_websocket
return ws(environ, start_response)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/engineio/async_eventlet.py", line 19, in __call__
return super(WebSocketWSGI, self).__call__(environ, start_response)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/eventlet/websocket.py", line 129, in __call__
self.handler(ws)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/engineio/socket.py", line 158, in _websocket_handler
pkt = ws.wait()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/eventlet/websocket.py", line 787, in wait
for i in self.iterator:
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/eventlet/websocket.py", line 642, in _iter_frames
message = self._recv_frame(message=fragmented_message)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/eventlet/websocket.py", line 668, in _recv_frame
header = recv(2)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/eventlet/websocket.py", line 577, in _get_bytes
d = self.socket.recv(numbytes - len(data))
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/eventlet/greenio/base.py", line 363, in recv
return self._recv_loop(self.fd.recv, b'', bufsize, flags)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/eventlet/greenio/base.py", line 357, in _recv_loop
self._read_trampoline()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/eventlet/greenio/base.py", line 328, in _read_trampoline
timeout_exc=socket_timeout('timed out'))
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/eventlet/greenio/base.py", line 207, in _trampoline
mark_as_closed=self._mark_as_closed)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/eventlet/hubs/__init__.py", line 163, in trampoline
return hub.switch()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 295, in switch
return self.greenlet.switch()
socket.timeout: timed out
然后我在堆栈跟踪后注意到这个日志:
127.0.0.1 - - [09/Apr/2018 11:46:46] "GET /socket.io/?EIO=3&transport=websocket&sid=201dbaa98d9844c998c5e94366838d24 HTTP/1.1" 500 0 60.089869
2018-04-09 11:46:46,454 - portal.api.sockets - INFO - User USERNAME has disconnected to sockets
因此60秒后有一个套接字超时,这是默认的flask-socketio ping_interval时间。所以在我的flask-socketio init中,我将ping设置更改为
socket_io = SocketIO(
app,
...
ping_interval=2000,
ping_timeout=120000,
...)
在客户端javascript我也改变了socketio超时:
socket = io.connect(location.origin, {
'timeout': 120000 // Increasing connection timeout.
})
这似乎减少了错误,但它仍然会发生。特别是它在服务器上会比在本地发生的更多。
经过一些调试后我得出结论,错误与用户偶尔断开房间而不试图重新连接到这些房间有关。为了证实我的怀疑,我创建了一个连接房间的调试方法。因此,在带有此错误的浏览器选项卡中,我加入了房间,它“失败”加入调试功能,并尝试在另一个窗口中向其发送消息并收到它。所以错误可能与我的join_room函数有关,就是这个
@socket_io.on('connect-to-chat-room')
def connect_to_chat(chat_uuid):
# Join chat room and see if user is authorized to join it.
chat_room = get_chat_by_uuid(chat_uuid)
if is_authorized_object(chat_permission, chat_room.chat_id):
logger.debug("Socket connection attempt to chat_uuid{}".format(chat_uuid))
join_room(chat_uuid)
logger.info("User {} sockets connected to chat room with uuid:{}".format(current_user.username, chat_uuid))
else:
return False
在做了一些研究之后,我注意到一个github问题,解释了你必须使用socketio.sleep来释放CPU(https://github.com/miguelgrinberg/Flask-SocketIO/issues/670)。所以我更新了我的函数以进行一些睡眠调用以降低CPU使用率。
@socket_io.on('connect-to-chat-room')
def connect_to_chat(chat_uuid):
# Join chat room and see if user is authorized to join it.
socket_io.sleep(0)
chat_room = get_chat_by_uuid(chat_uuid)
if is_authorized_object(chat_permission, chat_room.chat_id):
socket_io.sleep(0)
logger.debug("Socket connection attempt to chat_uuid{}".format(chat_uuid))
join_room(chat_uuid)
socket_io.sleep(0)
logger.info("User {} sockets connected to chat room with uuid:{}".format(current_user.username, chat_uuid))
else:
socket_io.sleep(0)
return False
然而,这似乎并没有解决错误。我发现有关此错误的另一个有趣的消息是,在进行一些浏览器测试后,它在Safari中的发生频率远远高于在chrome中发生的情况。使用chrome时,40个标签中的1个将出现错误,而在safari中,该数字接近40个中的6个。我相信这是由于safari从websockets切换到longpolling虽然这只是一个猜测,我不完全确定为什么那是。
现在我很困惑为了解决这个问题下一步该做什么。有人可以帮忙吗?