Question

我正在处理发生在nodejs应用程序中的奇怪的100％CPU使用率问题。该应用程序非常庞大，我不确定它导致了这种情况。该应用程序由cluster_mode中的pm2管理。

我所知道的是在高CPU使用率时，strace输出：

root@a:/# strace -p 4350 -c
Process 4350 attached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.000031           0      3388           clock_gettime
  0.00    0.000000           0         1           read
  0.00    0.000000           0         2           write
  0.00    0.000000           0         1           rt_sigreturn
------ ----------- ----------- --------- --------- ----------------
100.00    0.000031                  3392           total

root@a:~# strace -p 3367 -r -c
Process 3367 attached
^CProcess 3367 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 75.00    0.000939           0     91973           gettimeofday
 25.00    0.000313           0     39417           clock_gettime
------ ----------- ----------- --------- --------- ----------------
100.00    0.001252                131390           total

此时整个应用都没有响应。大约5分钟后，由于报告的内存为零，pm2将启动并重新启动此流程：

2016-12-03-20:29:05 PM2 [PM2][WORKER] Process 1 restarted because it uses 0 memory and has ONLINE status
2016-12-03-20:29:05 PM2 Stopping app:api-v2 id:1
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:07 PM2 Process with pid 3367 still alive after 1600ms, sending it SIGKILL now...
2016-12-03-20:29:07 PM2 App name:api-v2 id:1 disconnected
2016-12-03-20:29:07 PM2 App [api-v2] with id [1] and pid [3367], exited with code [0] via signal [SIGKILL]
2016-12-03-20:29:07 PM2 Starting execution sequence in -cluster mode- for app name:api-v2 id:1
2016-12-03-20:29:07 PM2 App name:api-v2 id:1 online

显然这是由pm2错误引起的：https://github.com/Unitech/pm2/issues/2492。但是，如果他们修复了这个错误，它就不会重新开始这个过程，从而使它停滞不前，我别无选择，只能坚持使用旧版本。

如果我使用time和strace启动流程，则：

real    0m45.765s
user    0m3.349s
sys 0m0.340s
www-data@a:~/$ strace -cf node /var/www/api-v2.js
Process 4020 attached
...
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.09    9.247853        4538      2038        26 futex
  0.18    0.016793           1     17777           clock_gettime
  0.16    0.015384          12      1262           epoll_wait
  0.11    0.010522         116        91           poll
  0.09    0.008339           2      5237      2437 stat
  0.08    0.007856           6      1234           write
  0.05    0.004309           3      1571           close
  0.03    0.003150           2      1790           read
  0.03    0.003150           2      1333       248 open
  0.03    0.003046          11       265           mmap
  0.02    0.002049           2      1186           lstat
  0.02    0.001617           4       378           madvise
  0.02    0.001535           2       917           fstat
  0.02    0.001518           1      1773           gettimeofday
  0.01    0.001096           1      1224        35 epoll_ctl
  0.01    0.000983           3       329        37 connect
  0.01    0.000792           1       667       329 accept4
  0.01    0.000734          10        76           brk
  0.01    0.000617           2       338           pread
  0.00    0.000315           2       155           socket
  0.00    0.000265           9        30           sendmmsg
  0.00    0.000184           1       144           munmap
  0.00    0.000162           1       113           mprotect
  0.00    0.000125           4        35           sendto
  0.00    0.000114           7        16           setsockopt
  0.00    0.000078           1        60           recvfrom
  0.00    0.000071           1       105           recvmsg
  0.00    0.000064           2        35           writev
  0.00    0.000052           7         8           clone
  0.00    0.000049           2        20        20 access
  0.00    0.000043           0       192           getsockname
  0.00    0.000029           7         4           getdents
  0.00    0.000024           1        36           bind
  0.00    0.000023          23         1           readlink
  0.00    0.000020           1        35           getsockopt
  0.00    0.000019          19         1           execve
  0.00    0.000018           0        86         9 ioctl
  0.00    0.000011           2         5           rt_sigprocmask
  0.00    0.000009           5         2           openat
  0.00    0.000006           1        11           getcwd
  0.00    0.000005           5         1           lseek
  0.00    0.000005           0        35           rt_sigaction
  0.00    0.000003           3         1           arch_prctl
  0.00    0.000000           0         1           listen
  0.00    0.000000           0        14           uname
  0.00    0.000000           0         2           getrlimit
  0.00    0.000000           0         2           getuid
  0.00    0.000000           0         1           getgid
  0.00    0.000000           0         1           geteuid
  0.00    0.000000           0         1           getegid
  0.00    0.000000           0         4           prctl
  0.00    0.000000           0         1           setrlimit
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           clock_getres
  0.00    0.000000           0         9           set_robust_list
  0.00    0.000000           0         1           eventfd2
  0.00    0.000000           0         1           epoll_create1
  0.00    0.000000           0         2           dup3
  0.00    0.000000           0         2           pipe2
------ ----------- ----------- --------- --------- ----------------
100.00    9.333037                 40661      3141 total

我的代码中没有setTimeout个调用，但我想我有依赖关系。我已经回顾了最近的更改，它似乎没有涉及永远不会结束的递归调用或循环。

我观察到零内存泄漏，即pm2内存大小不会超时增加。以前相同的程序运行2个月而没有重新启动，类似的负载。服务器有更多的CPU，RAM和交换资源。

问题开始出现在ubuntu的一些例行维护之后（apt-get升级升级了nodejs，mongodb，以及npm依赖升级）。 nodejs升级从4.6.1升级到4.6.2。但当我降级回4.6.1时，问题仍然存在。我试过4.4.7和6.9.1，没有版本似乎没有问题。

如何调试此问题？我从哪里开始？

Answer 1

我的问题中的调试技巧不正确。我正在进行这样的调试只是因为当你搜索'nodejs 100％cpu utilization'时谷歌的搜索结果就是这样的。结果他们误导了。

正确的技术是让节点本身允许通过node --debug=7000进行调试。在高CPU利用率时，运行调试客户端node debug localhost:7001。并通过pause暂停执行。尝试暂停并运行几次，您将能够确定执行的位置。

事实证明这是一种无限循环的情况，即for (i=10; i>=0; i++)。

我将离开问题并在此处回答以防其他人在搜索解决方案时遇到类似行为。

Nodejs 100％的CPU使用率，由于clock_gettime / gettimeofday / futex？

1 个答案: