我们有带有数据节点机器的hadoop集群
我们注意到DATANODE计算机上的平均CPU负载很高
uptime
17:27:46 up 263 days, 3:39, 3 users, load average: 7.94, 6.66, 7.38
简短验证后,我们注意到有很多删除文件(来自lsof)
示例
[root@DATANODE02 ~]# lsof +L1
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NLINK NODE NAME
avahi-dae 1938 avahi 5r REG 253,2 10406312 0 402658715 /var/lib/sss/mc/initgroups (deleted)
avahi-dae 1949 avahi 5r REG 253,2 10406312 0 402658715 /var/lib/sss/mc/initgroups (deleted)
sssd 1990 root 17r REG 253,2 10406312 0 402658715 /var/lib/sss/mc/initgroups (deleted)
sssd_be 1996 root 20r REG 253,2 10406312 0 402658715 /var/lib/sss/mc/initgroups (deleted)
cupsd 2269 root 10r REG 253,0 3024 0 139474724 /etc/passwd+ (deleted)
smcd 12588 root 15u REG 253,0 41590 0 13826415 /tmp/tmpfHHZRQO (deleted)
bluetooth 138025 root 9r FIFO 253,0 0t0 0 844091 /tmp/hogsuspend (deleted)
gnome-she 138037 root 20r REG 253,0 56 0 68959031 /root/.local/share/gvfs-metadata/home.55Q9UZ (deleted)
gnome-she 138037 root 24r REG 253,0 32768 0 70246314 /root/.local/share/gvfs-metadata/home-a9398246.log (deleted)
java 193699 yarn 1082r REG 8,16 293715 0 93588652 /grid/sdb/hadoop/hdfs/data/current/BP-428352611-43.21.3.46-1502127526112/current/finalized/subdir15/subdir35/blk_1186014185 (deleted)
java 193699 yarn 1191r REG 8,80 292993 0 88474445 /grid/sdf/hadoop/hdfs/data/current/BP-428352611-43.21.3.46-1502127526112/current/finalized/subdir15/subdir35/blk_1186014091 (deleted)
java 193699 yarn 1205r REG 8,16 2303 0 93588671 /grid/sdb/hadoop/hdfs/data/current/BP-428352611-43.21.3.46-1502127526112/current/finalized/subdir15/subdir35/blk_1186014185_112276263.meta (deleted)
java 193699 yarn 1265r REG 8,32 23931 0 25962378 /grid/sdc/hadoop/hdfs/data/current/BP-428352611-43.21.3.46-1502127526112/current/finalized/subdir15/subdir36/blk_1186014275 (deleted)
java 193699 yarn 1273r REG 8,32 195 0 25962397 /grid/sdc/hadoop/hdfs/data/current/BP-428352611-43.21.3.46-1502127526112/current/finalized/subdir15/subdir36/blk_1186014275_112276353.meta (deleted)
java 193699 yarn 1307r REG 8,48 66713 0 61461179 /grid/sdd/hadoop/hdfs/data/current/BP-428352611-43.21.3.46-1502127526112/current/finalized/subdir15/subdir36/blk_1186014410 (deleted)
java 193699 yarn 1385r REG 8,48 531 0 61461193 /grid/sdd/hadoop/hdfs/data/current/BP-428352611-43.21.3.46-1502127526112/current/finalized/subdir15/subdir36/blk_1186014410_112276488.meta (deleted)
java 193699 yarn 1477r REG 8,80 2299 0 88474446 /grid/sdf/hadoop/hdfs/data/current/BP-428352611-43.21.3.46-1502127526112/current/finalized/subdir15/subdir35/blk_1186014091_112276169.meta (deleted)
java 193699 yarn 1754r REG 8,16 91051 0 93696129 /grid/sdb/hadoop/hdfs/data/current/BP-428352611-43.21.3.46-1502127526112/current/finalized/subdir15/subdir37/blk_1186014689 (deleted)
java 193699 yarn 1760r REG 8,16 719 0 93696130 /grid/sdb/hadoop/hdfs/data/current/BP-428352611-43.21.3.46-1502127526112/current/finalized/subdir15/subdir37/blk_1186014689_112276769.meta (deleted)
java 193699 yarn 1972r REG 8,48 37960 0 61447490 /grid/sdd/hadoop/hdfs/data/current/BP-428352611-43.21.3.46-1502127526112/current/finalized/subdir15/subdir39/blk_1186015148 (deleted)
java 193699 yarn 1976r REG 8,48 307 0 61447491 /grid/sdd/hadoop/hdfs/data/current/BP-428352611-43.21.3.46-1502127526112/current/finalized/subdir15/subdir39/blk_1186015148_112277228.meta (deleted)
仅打印已删除文件的PID:
lsof +L1 | awk '{print $2}' | sort | uniq
12588
138025
138037
138151
138185
1938
1949
1990
1996
2269
因为上面的所有文件都不存在
为
/grid/sdd/hadoop/hdfs/data/current/BP-428352611-43.21.3.46-1502127526112/current/finalized/subdir15/subdir39/blk_1186015148_112277228.meta
我们杀死了所有的PID
为
kill 12588
kill 138025
以此类推
在我们杀死所有PID之后,CPU负载平均随以下情况而降低
uptime
17:27:46 up 263 days, 3:39, 3 users, load average: 2.24, 4.61, 5.75
我的问题是
是什么原因导致尽管删除了文件,但pId仍然保持打开状态?
是否可以用
杀死PID kill PID