最近我遇到了Apache的新麻烦。我们有一个运行在
上的Flask(1.0.2)中的Python(3.5)编写的应用程序$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.2 LTS
Release: 16.04
Codename: xenial
我们在ELB(AWS Elasict Loadbalancer)后面有两台服务器,然后它们突然(在该配置上运行了三个月以上)开始出现故障。我发现来自ELB的警报和用于监视的外部工具。我们突然开始收到E408(超时)和E503(服务不可用)。
我开始挖掘可能是什么原因, 在apache日志中,我发现很多这样的消息(似乎就在失败之前):
[Mon Jun 25 22:27:04.613967 2018] [wsgi:error] [pid 1275:tid 139684390848256] (70008)Partial results are valid but processing is incomplete: [client 1.2.3.4:2819] mod_wsgi (pid=1275): Unable to get bucket brigade for request., referer: https://xx.xx.xx/
我还查询了syslog并看到了这个:
Jun 25 22:13:25 my_hostname systemd[1]: Created slice User Slice of ubuntu.
Jun 25 22:13:25 my_hostname systemd[1]: Starting User Manager for UID 1000...
Jun 25 22:13:25 my_hostname systemd[1]: Started Session 1424 of user ubuntu.
Jun 25 22:13:25 my_hostname systemd[6239]: Reached target Sockets.
Jun 25 22:13:25 my_hostname systemd[6239]: Reached target Timers.
Jun 25 22:13:25 my_hostname systemd[6239]: Reached target Paths.
Jun 25 22:13:25 my_hostname systemd[6239]: Reached target Basic System.
Jun 25 22:13:25 my_hostname systemd[6239]: Reached target Default.
Jun 25 22:13:25 my_hostname systemd[6239]: Startup finished in 8ms.
Jun 25 22:13:25 my_hostname systemd[1]: Started User Manager for UID 1000.
Jun 25 22:14:47 my_hostname systemd[1]: Stopping LSB: Apache2 web server...
Jun 25 22:14:47 my_hostname apache2[6624]: * Stopping Apache httpd web server apache2
Jun 25 22:14:59 my_hostname apache2[6624]: *
Jun 25 22:14:59 my_hostname systemd[1]: Stopped LSB: Apache2 web server.
Jun 25 22:14:59 my_hostname systemd[1]: Starting LSB: Apache2 web server...
Jun 25 22:14:59 my_hostname apache2[6660]: * Starting Apache httpd web server apache2
Jun 25 22:14:59 my_hostname apache2[6660]: AH00557: apache2: apr_sockaddr_info_get() failed for my_hostname
Jun 25 22:14:59 my_hostname apache2[6660]: AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 127.0.0.1. Set the 'ServerName' directive globally to suppress this message
Jun 25 22:15:00 my_hostname apache2[6660]: *
Jun 25 22:15:00 my_hostname systemd[1]: Started LSB: Apache2 web server.
有趣的是,两个服务器(几乎完全相同)同时发生故障(由于部署了新版本,因此它们大约在同一时间重新启动,并且两者的流量可能几乎相同,因为它们位于一个负载均衡器的后面)。 / p>
已经尝试发现类似问题,但到目前为止还没有运气。
还有一件有趣的事情,我在日志中发现了几条这样的消息:
[Mon Jun 25 22:27:04.657763 2018] [wsgi:error] [pid 1274:tid 139684507617024] [remote 172.31.12.149:720] mod_wsgi (pid=1274): Exception occurred processing WSGI script '/home/ubuntu/my_app/app.wsgi'.
[Mon Jun 25 22:27:04.658503 2018] [wsgi:error] [pid 1274:tid 139684482414336] [remote 172.31.12.149:62417] mod_wsgi (pid=1274): Exception occurred processing WSGI script '/home/ubuntu/my_app/app.wsgi'.
[Mon Jun 25 22:27:04.658528 2018] [wsgi:error] [pid 1274:tid 139684532819712] [remote 172.31.12.149:52139] mod_wsgi (pid=1274): Exception occurred processing WSGI script '/home/ubuntu/my_app/app.wsgi'.
[Mon Jun 25 22:27:04.658584 2018] [wsgi:error] [pid 1274:tid 139684482414336] [remote 172.31.12.149:62417] OSError: failed to write data
[Mon Jun 25 22:27:04.658818 2018] [wsgi:error] [pid 1274:tid 139684516017920] [remote 172.31.12.149:208] OSError: failed to write data
[Mon Jun 25 22:27:04.659999 2018] [wsgi:error] [pid 1274:tid 139684532819712] [remote 172.31.12.149:52139] OSError: failed to write data
[Mon Jun 25 22:27:04.660411 2018] [wsgi:error] [pid 1274:tid 139684507617024] [remote 172.31.12.149:720] OSError: failed to write data
不确定是否可以关联,但是我确定我们会在请求完成之前取消很多请求(出于原因)。
此外,我们在Ubuntu + Flask(以及更多可能是相同的设置)上运行了多年,我们从来没有遇到过这样的问题。
真的很感谢任何想法,谢谢!