我注意到我的所有Hadoop数据节点通常都为“ mapred”用户提供大约10-15个Java进程,一次最多可以挂几天。
只有两个过程正在积极开展工作。其余的似乎很早就已经成功完成了各自的MapReduce任务,但是他们从未退出过。
这令人担忧,因为尽管这些进程已经完成工作,但它们仍保留着宝贵的内存分配以及映射/减少任务插槽。 (在某些服务器上累积了58 GB的VSZ和30+ GB的RSS。)
这是一个这样的过程的示例(略有缩短)。当前是MST的19/7/19 pm,您可以看到该进程很早以前在12:01开始:
$ ps aux | grep mapred
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
mapred 2915 0.2 0.1 1749504 365004 ? Ssl 12:01 0:09 /usr/java/jdk1.7.0_00/jre/bin/java -Djava.library.path=/usr/lib/hadoop-0.20- mapreduce/lib/native/Linux-amd64-64:/disk6/mapreduce/tmp-map-data/taskTracker /mapred/jobcache/job_201902042217_28716 /attempt_201902042217_28716_m_000000_0/work -Xmx1024m -Djava.io.tmpdir=/disk6 /mapreduce/tmp-map-data/taskTracker/mapred/jobcache/job_201902042217_28716 /attempt_201902042217_28716_m_000000_0/work/tmp
..
TLA -Dhadoop.tasklog.taskid=attempt_201902042217_28716_m_000000_0 -Dhadoop.tasklog.iscleanup=false -Dhadoop.tasklog.totalLogFileSize=0 org.apache.hadoop.mapred.Child 127.0.0.1 47071 attempt_201902042217_28716_m_000000_0 /var/log/hadoop-0.20-mapreduce/userlogs /job_201902042217_28716/attempt_201902042217_28716_m_000000_0 -1455485471
这是JobTracker的摘要,清楚地显示了上述Job及其所有任务的完成情况:
Hadoop Job 28716 on History Viewer
User: mapred
JobName: oozie:launcher:T=java:W=<censored>:A=checkStatus:ID=0007877-190202224243941-oozie-oozi-W
JobConf: hdfs://PRODcluster/tmp/hadoop-mapred/mapred/staging/mapred /.staging/job_201902042217_28716/job.xml
Job-ACLs: All users are allowed
Submitted At: 7-Feb-2019 12:01:04
Launched At: 7-Feb-2019 12:01:04 (0sec)
Finished At: 7-Feb-2019 12:01:16 (12sec)
Status: SUCCESS
Analyse This Job
Kind Total Tasks(successful+failed+killed) Successful tasks Failed tasks Killed tasks Start Time Finish Time
Setup 1 1 0 0 7-Feb-2019 12:01:04 7-Feb-2019 12:01:06 (1sec)
Map 1 1 0 0 7-Feb-2019 12:01:08 7-Feb-2019 12:01:14 (5sec)
Reduce 0 0 0 0
Cleanup 1 1 0 0 7-Feb-2019 12:01:14 7-Feb-2019 12:01:16 (1sec)
这是任务日志尾部,显示干净的完成:
$ sudo ls -lh /var/log/hadoop-0.20-mapreduce/userlogs/job_201902042217_28716/attempt_201902042217_28716_m_000000_0/syslog
-rw-r--r-- 1个被映射的被映射22K Feb 7 12:01 /var/log/hadoop-0.20-mapreduce/userlogs/job_201902042217_28716/attempt_201902042217_28716_m_000000_0/syslog
尾巴:
2019-02-07 12:01:12,983 INFO org.apache.hadoop.mapred.Task: Task:attempt_201902042217_28716_m_000000_0 is done. And is in the process of commiting
2019-02-07 12:01:14,105 INFO org.apache.hadoop.mapred.Task: Task attempt_201902042217_28716_m_000000_0 is allowed to commit now
2019-02-07 12:01:14,128 INFO org.apache.hadoop.mapred.FileOutputCommitter: Saved output of task 'attempt_201902042217_28716_m_000000_0' to hdfs://PRODcluster:8020/user/mapred/oozie-oozi/0007877-190202224243941-oozie-oozi-W/checkStatus--java/output
2019-02-07 12:01:14,131 INFO org.apache.hadoop.mapred.Task: Task 'attempt_201902042217_28716_m_000000_0' done.
2019-02-07 12:01:14,133 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
如果我发现FUTEX_WAIT超时,则跟踪该过程:
$ sudo strace -p 301 -f
[pid 339] futex(0x7f5cac0d7c54, FUTEX_WAIT_BITSET_PRIVATE, 1, {11581718, 780361675}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 339] futex(0x7f5cac0d7c28, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 339] futex(0x7f5cac0d7c54, FUTEX_WAIT_BITSET_PRIVATE, 1, {11581718, 830757941}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 339] futex(0x7f5cac0d7c28, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 339] futex(0x7f5cac0d7c54, FUTEX_WAIT_BITSET_PRIVATE, 1, {11581718, 881088118}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 339] futex(0x7f5cac0d7c28, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 339] futex(0x7f5cac0d7c54, FUTEX_WAIT_BITSET_PRIVATE, 1, {11581718, 931488956}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 339] futex(0x7f5cac0d7c28, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 339] futex(0x7f5cac0d7c54, FUTEX_WAIT_BITSET_PRIVATE, 1, {11581718, 981908072}, ffffffff <unfinished ...>
[pid 329] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 329] futex(0x7f5cac09cc28, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 329] futex(0x7f5cac09cc54, FUTEX_WAIT_BITSET_PRIVATE, 1, {11581719, 976658613}, ffffffff <unfinished ...>
[pid 339] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 339] futex(0x7f5cac0d7c28, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 339] futex(0x7f5cac0d7c54, FUTEX_WAIT_BITSET_PRIVATE, 1, {11581719, 32166788}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid 339] futex(0x7f5cac0d7c28, FUTEX_WAKE_PRIVATE, 1) = 0
这些工作通过协调的Oozie工作流程进入。我使用的是Hadoop 2.0.0-cdh4.3.0版本,具有2个namenode和3个Zookeeper节点。其中1个Zookeeper节点已经关闭了几个月,但不确定是否完全相关。 请让我知道这里是否还有其他内容。