htop从proc

时间:2018-07-19 14:28:15

标签: linux ubuntu process operating-system zombie-process

我有一个带有多个线程的Zombie进程(运行多个线程的python3)的系统。问题是,应用程序htop卡在了不间断的磁盘睡眠(D)中。 strace指向htop表示在扫描/proc/ZPID/task/TID/cmdline时,我已经用cat /proc/ZPID/task/TID/cmdline手动确认了这一点,它也进入了D。

这表示这不是htop特定的。标题具有htop,因为问题是从此开始的。详细信息如下。

htop和读取此伪文件(或其他可能在子树中的其他文件)的任何此类应用程序陷入状态D。因此,当用户执行htop时,它会累积。只需htop在循环中运行cat,只需循环bash即可使系统停止运行,这只能通过拉线来解决。

我确定了僵尸的父进程,并杀死了父进程(tmuxinit中运行),以便将其重新绑定到systemd,在这种情况下为{{1 }}。虽然,我希望孩子能得到收割,但是尽管python3僵尸进程的PPID现在为1,但仍然仍处于僵尸状态。另外,所有读取僵尸进程htop的{​​{1}},cat等仍然处于D状态。

因此,除了重启系统以摆脱僵尸之外,似乎没有其他方法。

问题

  1. 如何停止/proc/ZPID/task/TID/cmdline(或访问这些伪文件的任何其他进程)实例以永久进入不间断的磁盘睡眠。
  2. 有什么方法可以使init / systemd在不删除的情况下获得Zombie htop进程?

尽管我可以理解现在没有什么可以做的,但是如果有人至少可以帮助我准确地了解系统为什么会遇到这种情况,那将是一件好事。

以下是有关系统的一些信息,可能会有所帮助。

调试

操作系统版本

python3

内核版本

head -n2 /etc/os-release

NAME="Ubuntu"
VERSION="16.04.4 LTS (Xenial Xerus)

我尝试将uname -srvm Linux 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64 发送到1(它不需要)。我尝试SIGCHLRstrace,但没有帮助。

/ proc / PID /状态转储

pstack

要注意的是,这里只有1个自愿和非自愿上下文切换。因此,该过程将无处进行,并且需要运行数周。

/ proc / PID /预定转储

Name:   htop
State:  D (disk sleep)
Tgid:   23395
Ngid:   0
Pid:    23395
PPid:   1
TracerPid:      0
Uid:    1005    1005    1005    1005
Gid:    1005    1005    1005    1005
FDSize: 256
Groups: 27 1005 
NStgid: 23395
NSpid:  23395
NSpgid: 23395
NSsid:  23370
VmPeak:    27040 kB
VmSize:    27008 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:      4948 kB
VmRSS:      3212 kB
VmData:     1700 kB
VmStk:       132 kB
VmExe:       156 kB
VmLib:      3548 kB
VmPTE:        72 kB
VmPMD:        12 kB
VmSwap:        0 kB
HugetlbPages:          0 kB
Threads:        1
SigQ:   20/515408
SigPnd: 0000000000000000
ShdPnd: 0000000000000003
SigBlk: 0000000000000000
SigIgn: 0000000000000000
SigCgt: 0000000008084402
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
Seccomp:        0
Cpus_allowed:   ffff,ffffffff
Cpus_allowed_list:      0-47
Mems_allowed:   00000000,00000003
Mems_allowed_list:      0-1
voluntary_ctxt_switches:        1
nonvoluntary_ctxt_switches:     1

下面是htop (23395, #threads: 1) ------------------------------------------------------------------- se.exec_start : 5618689963.080376 se.vruntime : 329703377.411193 se.sum_exec_runtime : 42.996258 se.statistics.sum_sleep_runtime : 0.000000 se.statistics.wait_start : 0.000000 se.statistics.sleep_start : 0.000000 se.statistics.block_start : 5618689963.080376 se.statistics.sleep_max : 0.000000 se.statistics.block_max : 0.000000 se.statistics.exec_max : 3.999987 se.statistics.slice_max : 0.000000 se.statistics.wait_max : 0.002956 se.statistics.wait_sum : 0.002956 se.statistics.wait_count : 3 se.statistics.iowait_sum : 0.000000 se.statistics.iowait_count : 0 se.nr_migrations : 2 se.statistics.nr_migrations_cold : 0 se.statistics.nr_failed_migrations_affine : 0 se.statistics.nr_failed_migrations_running : 0 se.statistics.nr_failed_migrations_hot : 0 se.statistics.nr_forced_migrations : 0 se.statistics.nr_wakeups : 0 se.statistics.nr_wakeups_sync : 0 se.statistics.nr_wakeups_migrate : 0 se.statistics.nr_wakeups_local : 0 se.statistics.nr_wakeups_remote : 0 se.statistics.nr_wakeups_affine : 0 se.statistics.nr_wakeups_affine_attempts : 0 se.statistics.nr_wakeups_passive : 0 se.statistics.nr_wakeups_idle : 0 avg_atom : 21.498129 avg_per_cpu : 21.498129 nr_switches : 2 nr_voluntary_switches : 1 nr_involuntary_switches : 1 se.load.weight : 1024 se.avg.load_sum : 47861560 se.avg.util_sum : 47860718 se.avg.load_avg : 1002 se.avg.util_avg : 1002 se.avg.last_update_time : 5618689963080376 policy : 0 prio : 120 clock-delta : 18 mm->numa_scan_seq : 0 numa_pages_migrated : 1 numa_preferred_nid : -1 total_numa_faults : 22 current_node=0, numa_group_id=0 numa_faults node=0 task_private=0 task_shared=0 group_private=0 group_shared=0 numa_faults node=1 task_private=0 task_shared=0 group_private=0 group_shared=0 转储。

/proc/PID/syscall

当我搜索0 0x5 0x7ffd47db2970 0x1000 0x5 0x0 0x1e 0x7ffd47db28a8 0x7f836a517260 时,系统调用0会转换为__NR_read

/ proc / PID /堆栈转储

grep -e "0$" /usr/include/x86_64-linux-gnu/asm/unistd_64.h

因此,执行现在在[<ffffffff8140e3c4>] call_rwsem_down_read_failed+0x14/0x30 [<ffffffff81284f6e>] proc_pid_cmdline_read+0xae/0x530 [<ffffffff8121380b>] __vfs_read+0x1b/0x40 [<ffffffff81213de6>] vfs_read+0x86/0x130 [<ffffffff81214b35>] SyS_read+0x55/0xc0 [<ffffffff8184efc8>] entry_SYSCALL_64_fastpath+0x1c/0xbb [<ffffffffffffffff>] 0xffffffffffffffff 中。从代码看来,它阻塞了一个信号量(?)(Ref https://github.com/torvalds/linux/blob/v4.16/arch/x86/lib/rwsem.S#L89https://lkml.org/lkml/2010/11/23/272

系统中使用了一个信号量(根据call_rwsem_down_read_failed),它属于ipcs

更新

我尝试使用/sbin/iscsid列出Zombie进程的线程,但不会阻塞。 ps -Teo pid,ppid,state,comm,args $ZOMBIEPID strace表示ps从不访问/proc/ZPID/task/TID/cmdline。我认为ps只是从此目录获取TID,并从/proc/TID/cmdline获取cmdline内容。另外,/proc/ZPID/task/TID/中的其他伪文件也是可读的。

我使用脚本读取了Zombie processe的TID syscall文件。我认为这可以解释更多信息。

下面, 44529 是僵尸进程,其余是任务。它们全都卡在__NR_futex的syscal 202中,因此它们正在等待用户空间互斥体。请注意,python3实例正在运行多线程代码,该代码使用python的multiprocessing库。主线程的系统调用值为-1。

tid: 44529, syscall: -1 0x7ffca68c6e68 0x7f74ca51914b
tid: 45111, syscall: -1 0x7f7468ffc938 0x7f74c9309d30
tid: 45112, syscall: 202 0x7f74c886133c 0x80 0x5 0x0 0x7f74c8861300 0x2 0x7f746b7fdee0 0x7f74ca791360
tid: 45113, syscall: 202 0x7f74c88613bc 0x80 0x5 0x0 0x7f74c8861300 0x2 0x7f74adffeee0 0x7f74ca791360
tid: 45114, syscall: 202 0x7f74c886143c 0x80 0x5 0x0 0x7f74c8861400 0x2 0x7f74b07ffee0 0x7f74ca791360
tid: 45115, syscall: 202 0x7f74c88614bc 0x80 0x5 0x0 0x7f74c8861400 0x2 0x7f74c5d9dee0 0x7f74ca791360
tid: 45116, syscall: 202 0x7f74c886153c 0x80 0x5 0x0 0x7f74c8861500 0x2 0x7f74c559cee0 0x7f74ca791360
tid: 45117, syscall: 202 0x7f74c88615bc 0x80 0x5 0x0 0x7f74c8861500 0x2 0x7f74c4b58ee0 0x7f74ca791360
tid: 45118, syscall: 202 0x7f74c886163c 0x80 0x5 0x0 0x7f74c8861600 0x2 0x7f74c001eee0 0x7f74ca791360
tid: 45119, syscall: 202 0x7f74c88616bc 0x80 0x5 0x0 0x7f74c8861600 0x2 0x7f74bf81dee0 0x7f74ca791360
tid: 45120, syscall: 202 0x7f74c886173c 0x80 0x5 0x0 0x7f74c8861700 0x2 0x7f74bf01cee0 0x7f74ca791360
tid: 45121, syscall: 202 0x7f74c88617bc 0x80 0x5 0x0 0x7f74c8861700 0x2 0x7f74be81bee0 0x7f74ca791360
tid: 45122, syscall: 202 0x7f74c886183c 0x80 0x5 0x0 0x7f74c8861800 0x2 0x7f74be01aee0 0x7f74ca791360
tid: 45123, syscall: 202 0x7f74c88618bc 0x80 0x5 0x0 0x7f74c8861800 0x2 0x7f74bd15cee0 0x7f74ca791360
tid: 45124, syscall: 202 0x7f74c886193c 0x80 0x5 0x0 0x7f74c8861900 0x2 0x7f74bc0dcee0 0x7f74ca791360
tid: 45125, syscall: 202 0x7f74c88619bc 0x80 0x5 0x0 0x7f74c8861900 0x2 0x7f74bb15cee0 0x7f74ca791360
tid: 45126, syscall: 202 0x7f74c8861a3c 0x80 0x5 0x0 0x7f74c8861a00 0x2 0x7f74ba95bee0 0x7f74ca791360
tid: 45127, syscall: 202 0x7f74c8861abc 0x80 0x5 0x0 0x7f74c8861a00 0x2 0x7f74ba15aee0 0x7f74ca791360
tid: 45128, syscall: 202 0x7f74c8861b3c 0x80 0x5 0x0 0x7f74c8861b00 0x2 0x7f74b9959ee0 0x7f74ca791360
tid: 45129, syscall: 202 0x7f74c8861bbc 0x80 0x5 0x0 0x7f74c8861b00 0x2 0x7f74b9158ee0 0x7f74ca791360
tid: 45130, syscall: 202 0x7f74c8861c3c 0x80 0x5 0x0 0x7f74c8861c00 0x2 0x7f74b8957ee0 0x7f74ca791360
tid: 45131, syscall: 202 0x7f74c8861cbc 0x80 0x5 0x0 0x7f74c8861c00 0x2 0x7f74b8156ee0 0x7f74ca791360
tid: 45132, syscall: 202 0x7f74c8861d3c 0x80 0x5 0x0 0x7f74c8861d00 0x2 0x7f74b7955ee0 0x7f74ca791360
tid: 45133, syscall: 202 0x7f74c8861dbc 0x80 0x5 0x0 0x7f74c8861d00 0x2 0x7f74b7154ee0 0x7f74ca791360
tid: 45134, syscall: 202 0x7f74c8861e3c 0x80 0x5 0x0 0x7f74c8861e00 0x2 0x7f74b6953ee0 0x7f74ca791360
tid: 45135, syscall: 202 0x7f74c8861ebc 0x80 0x5 0x0 0x7f74c8861e00 0x2 0x7f74b6152ee0 0x7f74ca791360
tid: 45136, syscall: 202 0x7f74c8861f3c 0x80 0x5 0x0 0x7f74c8861f00 0x2 0x7f74b5951ee0 0x7f74ca791360
tid: 45137, syscall: 202 0x7f74c8861fbc 0x80 0x5 0x0 0x7f74c8861f00 0x2 0x7f74b5150ee0 0x7f74ca791360
tid: 45138, syscall: 202 0x7f74c886203c 0x80 0x5 0x0 0x7f74c8862000 0x2 0x7f74b494fee0 0x7f74ca791360
tid: 45139, syscall: 202 0x7f74c88620bc 0x80 0x5 0x0 0x7f74c8862000 0x2 0x7f74b414eee0 0x7f74ca791360
tid: 45140, syscall: 202 0x7f74c886213c 0x80 0x5 0x0 0x7f74c8862100 0x2 0x7f74b394dee0 0x7f74ca791360
tid: 45141, syscall: 202 0x7f74c88621bc 0x80 0x5 0x0 0x7f74c8862100 0x2 0x7f74b314cee0 0x7f74ca791360
tid: 45142, syscall: 202 0x7f74c886223c 0x80 0x5 0x0 0x7f74c8862200 0x2 0x7f74b294bee0 0x7f74ca791360
tid: 45143, syscall: 202 0x7f74c88622bc 0x80 0x5 0x0 0x7f74c8862200 0x2 0x7f74b214aee0 0x7f74ca791360
tid: 45144, syscall: 202 0x7f74c886233c 0x80 0x5 0x0 0x7f74c8862300 0x2 0x7f74b1949ee0 0x7f74ca791360
tid: 45145, syscall: 202 0x7f74c88623bc 0x80 0x5 0x0 0x7f74c8862300 0x2 0x7f74b1148ee0 0x7f74ca791360
tid: 45146, syscall: 202 0x7f74c886243c 0x80 0x5 0x0 0x7f74c8862400 0x2 0x7f7454d22ee0 0x7f74ca791360
tid: 45147, syscall: 202 0x7f74c88624bc 0x80 0x5 0x0 0x7f74c8862400 0x2 0x7f7454521ee0 0x7f74ca791360
tid: 45148, syscall: 202 0x7f74c886253c 0x80 0x5 0x0 0x7f74c8862500 0x2 0x7f7453d20ee0 0x7f74ca791360
tid: 45149, syscall: 202 0x7f74c88625bc 0x80 0x5 0x0 0x7f74c8862500 0x2 0x7f745351fee0 0x7f74ca791360

0 个答案:

没有答案