我正在处理发生在nodejs应用程序中的奇怪的100%CPU使用率问题。该应用程序非常庞大,我不确定它导致了这种情况。该应用程序由cluster_mode中的pm2管理。
我所知道的是在高CPU使用率时,strace
输出:
root@a:/# strace -p 4350 -c
Process 4350 attached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.000031 0 3388 clock_gettime
0.00 0.000000 0 1 read
0.00 0.000000 0 2 write
0.00 0.000000 0 1 rt_sigreturn
------ ----------- ----------- --------- --------- ----------------
100.00 0.000031 3392 total
root@a:~# strace -p 3367 -r -c
Process 3367 attached
^CProcess 3367 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
75.00 0.000939 0 91973 gettimeofday
25.00 0.000313 0 39417 clock_gettime
------ ----------- ----------- --------- --------- ----------------
100.00 0.001252 131390 total
此时整个应用都没有响应。大约5分钟后,由于报告的内存为零,pm2
将启动并重新启动此流程:
2016-12-03-20:29:05 PM2 [PM2][WORKER] Process 1 restarted because it uses 0 memory and has ONLINE status
2016-12-03-20:29:05 PM2 Stopping app:api-v2 id:1
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:07 PM2 Process with pid 3367 still alive after 1600ms, sending it SIGKILL now...
2016-12-03-20:29:07 PM2 App name:api-v2 id:1 disconnected
2016-12-03-20:29:07 PM2 App [api-v2] with id [1] and pid [3367], exited with code [0] via signal [SIGKILL]
2016-12-03-20:29:07 PM2 Starting execution sequence in -cluster mode- for app name:api-v2 id:1
2016-12-03-20:29:07 PM2 App name:api-v2 id:1 online
显然这是由pm2错误引起的:https://github.com/Unitech/pm2/issues/2492。但是,如果他们修复了这个错误,它就不会重新开始这个过程,从而使它停滞不前,我别无选择,只能坚持使用旧版本。
如果我使用time
和strace
启动流程,则:
real 0m45.765s
user 0m3.349s
sys 0m0.340s
www-data@a:~/$ strace -cf node /var/www/api-v2.js
Process 4020 attached
...
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
99.09 9.247853 4538 2038 26 futex
0.18 0.016793 1 17777 clock_gettime
0.16 0.015384 12 1262 epoll_wait
0.11 0.010522 116 91 poll
0.09 0.008339 2 5237 2437 stat
0.08 0.007856 6 1234 write
0.05 0.004309 3 1571 close
0.03 0.003150 2 1790 read
0.03 0.003150 2 1333 248 open
0.03 0.003046 11 265 mmap
0.02 0.002049 2 1186 lstat
0.02 0.001617 4 378 madvise
0.02 0.001535 2 917 fstat
0.02 0.001518 1 1773 gettimeofday
0.01 0.001096 1 1224 35 epoll_ctl
0.01 0.000983 3 329 37 connect
0.01 0.000792 1 667 329 accept4
0.01 0.000734 10 76 brk
0.01 0.000617 2 338 pread
0.00 0.000315 2 155 socket
0.00 0.000265 9 30 sendmmsg
0.00 0.000184 1 144 munmap
0.00 0.000162 1 113 mprotect
0.00 0.000125 4 35 sendto
0.00 0.000114 7 16 setsockopt
0.00 0.000078 1 60 recvfrom
0.00 0.000071 1 105 recvmsg
0.00 0.000064 2 35 writev
0.00 0.000052 7 8 clone
0.00 0.000049 2 20 20 access
0.00 0.000043 0 192 getsockname
0.00 0.000029 7 4 getdents
0.00 0.000024 1 36 bind
0.00 0.000023 23 1 readlink
0.00 0.000020 1 35 getsockopt
0.00 0.000019 19 1 execve
0.00 0.000018 0 86 9 ioctl
0.00 0.000011 2 5 rt_sigprocmask
0.00 0.000009 5 2 openat
0.00 0.000006 1 11 getcwd
0.00 0.000005 5 1 lseek
0.00 0.000005 0 35 rt_sigaction
0.00 0.000003 3 1 arch_prctl
0.00 0.000000 0 1 listen
0.00 0.000000 0 14 uname
0.00 0.000000 0 2 getrlimit
0.00 0.000000 0 2 getuid
0.00 0.000000 0 1 getgid
0.00 0.000000 0 1 geteuid
0.00 0.000000 0 1 getegid
0.00 0.000000 0 4 prctl
0.00 0.000000 0 1 setrlimit
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 1 clock_getres
0.00 0.000000 0 9 set_robust_list
0.00 0.000000 0 1 eventfd2
0.00 0.000000 0 1 epoll_create1
0.00 0.000000 0 2 dup3
0.00 0.000000 0 2 pipe2
------ ----------- ----------- --------- --------- ----------------
100.00 9.333037 40661 3141 total
我的代码中没有setTimeout
个调用,但我想我有依赖关系。我已经回顾了最近的更改,它似乎没有涉及永远不会结束的递归调用或循环。
我观察到零内存泄漏,即pm2内存大小不会超时增加。以前相同的程序运行2个月而没有重新启动,类似的负载。服务器有更多的CPU,RAM和交换资源。
问题开始出现在ubuntu的一些例行维护之后(apt-get升级升级了nodejs,mongodb,以及npm依赖升级)。 nodejs升级从4.6.1升级到4.6.2。但当我降级回4.6.1时,问题仍然存在。我试过4.4.7和6.9.1,没有版本似乎没有问题。
如何调试此问题?我从哪里开始?
答案 0 :(得分:1)
我的问题中的调试技巧不正确。我正在进行这样的调试只是因为当你搜索'nodejs 100%cpu utilization'时谷歌的搜索结果就是这样的。结果他们误导了。
正确的技术是让节点本身允许通过node --debug=7000
进行调试。在高CPU利用率时,运行调试客户端node debug localhost:7001
。并通过pause
暂停执行。尝试暂停并运行几次,您将能够确定执行的位置。
事实证明这是一种无限循环的情况,即for (i=10; i>=0; i++)
。
我将离开问题并在此处回答以防其他人在搜索解决方案时遇到类似行为。