我们在AWS节点上运行SonarQube 5.1.2。经过一段时间的使用,通常是一两天后,Sonar Web服务器变得没有响应,并且会刺激服务器的CPU:
top - 01:59:47 up 2 days, 3:43, 1 user, load average: 1.89, 1.76, 1.11
Tasks: 93 total, 1 running, 92 sleeping, 0 stopped, 0 zombie
Cpu(s): 94.5%us, 0.0%sy, 0.0%ni, 5.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 7514056k total, 2828772k used, 4685284k free, 155372k buffers
Swap: 0k total, 0k used, 0k free, 872440k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2328 root 20 0 3260m 1.1g 19m S 188.3 15.5 62:51.79 java
11 root 20 0 0 0 0 S 0.3 0.0 0:07.90 events/0
2284 root 20 0 3426m 407m 19m S 0.3 5.5 9:51.04 java
1 root 20 0 19356 1536 1224 S 0.0 0.0 0:00.23 init
188%的CPU负载来自WebServer进程:
$ ps -eF|grep "root *2328"
root 2328 2262 2 834562 1162384 0 Mar01 ? 01:06:24 /usr/java/jre1.8.0_25/bin/java -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djruby.management.enabled=false -Djruby.compile.invokedynamic=false -Xmx768m -XX:MaxPermSize=160m -XX:+HeapDumpOnOutOfMemoryError -Djava.io.tmpdir=/opt/sonar/temp -cp ./lib/common/*:./lib/server/*:/opt/sonar/lib/jdbc/mysql/mysql-connector-java-5.1.34.jar org.sonar.server.app.WebServer /tmp/sq-process615754070383971531properties
我们最初认为我们运行的节点太小而且最近升级到m3大型实例,但是我们看到了同样的问题(除了它现在加上2个CPU而不是一个)。
日志中唯一有趣的信息是:
2016.03.04 01:52:38 WARN web[o.e.transport] [sonar-1456875684135] Received response for a request that has timed out, sent [39974ms] ago, timed out [25635ms] ago, action [cluster:monitor/nodes/info], node [[#transport#-1][xxxxxxxx-build02-us-west-2b][inet[/127.0.0.1:9001]]], id [43817]
2016.03.04 01:53:19 INFO web[o.e.client.transport] [sonar-1456875684135] failed to get node info for [#transport#-1][xxxxxxxx-build02-us-west-2b][inet[/127.0.0.1:9001]], disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException: [][inet[/127.0.0.1:9001]][cluster:monitor/nodes/info] request_id [43817] timed out after [14339ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:366) ~[elasticsearch-1.4.4.jar:na]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [na:1.8.0_25]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [na:1.8.0_25]
at java.lang.Thread.run(Unknown Source) [na:1.8.0_25]
有谁知道这里可能会发生什么,或者有一些想法如何进一步诊断这个问题?