为什么在成功完成任务(hadoop)后,映射的Java进程没有退出

时间:2019-02-07 19:59:30

标签: java hadoop mapreduce

我注意到我的所有Hadoop数据节点通常都为“ mapred”用户提供大约10-15个Java进程,一次最多可以挂几天。

只有两个过程正在积极开展工作。其余的似乎很早就已经成功完成了各自的MapReduce任务,但是他们从未退出过。

这令人担忧,因为尽管这些进程已经完成工作,但它们仍保留着宝贵的内存分配以及映射/减少任务插槽。 (在某些服务器上累积了58 GB的VSZ和30+ GB的RSS。)

这是一个这样的过程的示例(略有缩短)。当前是MST的19/7/19 pm,您可以看到该进程很早以前在12:01开始:

$ ps aux | grep mapred

USER       PID %CPU %MEM    VSZ   RSS    TTY      STAT START   TIME COMMAND

mapred    2915  0.2  0.1 1749504 365004 ?      Ssl  12:01   0:09         /usr/java/jdk1.7.0_00/jre/bin/java -Djava.library.path=/usr/lib/hadoop-0.20-      mapreduce/lib/native/Linux-amd64-64:/disk6/mapreduce/tmp-map-data/taskTracker    /mapred/jobcache/job_201902042217_28716  /attempt_201902042217_28716_m_000000_0/work -Xmx1024m -Djava.io.tmpdir=/disk6  /mapreduce/tmp-map-data/taskTracker/mapred/jobcache/job_201902042217_28716  /attempt_201902042217_28716_m_000000_0/work/tmp 
..
TLA -Dhadoop.tasklog.taskid=attempt_201902042217_28716_m_000000_0     -Dhadoop.tasklog.iscleanup=false -Dhadoop.tasklog.totalLogFileSize=0     org.apache.hadoop.mapred.Child 127.0.0.1 47071         attempt_201902042217_28716_m_000000_0 /var/log/hadoop-0.20-mapreduce/userlogs    /job_201902042217_28716/attempt_201902042217_28716_m_000000_0 -1455485471

这是JobTracker的摘要,清楚地显示了上述Job及其所有任务的完成情况:

Hadoop Job 28716 on History Viewer
User: mapred
JobName: oozie:launcher:T=java:W=<censored>:A=checkStatus:ID=0007877-190202224243941-oozie-oozi-W
JobConf: hdfs://PRODcluster/tmp/hadoop-mapred/mapred/staging/mapred     /.staging/job_201902042217_28716/job.xml
Job-ACLs: All users are allowed
Submitted At: 7-Feb-2019 12:01:04
Launched At: 7-Feb-2019 12:01:04 (0sec)
Finished At: 7-Feb-2019 12:01:16 (12sec)
Status: SUCCESS
Analyse This Job
Kind    Total Tasks(successful+failed+killed)   Successful tasks            Failed tasks    Killed tasks    Start Time  Finish Time
Setup   1   1   0   0   7-Feb-2019 12:01:04     7-Feb-2019 12:01:06 (1sec)
Map     1   1   0   0   7-Feb-2019 12:01:08     7-Feb-2019 12:01:14     (5sec)
Reduce  0   0   0   0       
Cleanup     1   1   0   0   7-Feb-2019 12:01:14     7-Feb-2019 12:01:16 (1sec)

这是任务日志尾部,显示干净的完成:

$ sudo ls -lh /var/log/hadoop-0.20-mapreduce/userlogs/job_201902042217_28716/attempt_201902042217_28716_m_000000_0/syslog

-rw-r--r-- 1个被映射的被映射22K Feb 7 12:01 /var/log/hadoop-0.20-mapreduce/userlogs/job_201902042217_28716/attempt_201902042217_28716_m_000000_0/syslog

尾巴:

2019-02-07 12:01:12,983 INFO org.apache.hadoop.mapred.Task:     Task:attempt_201902042217_28716_m_000000_0 is done. And is in the process of commiting
2019-02-07 12:01:14,105 INFO org.apache.hadoop.mapred.Task: Task attempt_201902042217_28716_m_000000_0 is allowed to commit now
2019-02-07 12:01:14,128 INFO org.apache.hadoop.mapred.FileOutputCommitter: Saved output of task 'attempt_201902042217_28716_m_000000_0' to hdfs://PRODcluster:8020/user/mapred/oozie-oozi/0007877-190202224243941-oozie-oozi-W/checkStatus--java/output
2019-02-07 12:01:14,131 INFO org.apache.hadoop.mapred.Task: Task 'attempt_201902042217_28716_m_000000_0' done.
2019-02-07 12:01:14,133 INFO org.apache.hadoop.mapred.TaskLogsTruncater:     Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1

如果我发现FUTEX_WAIT超时,则跟踪该过程:

$ sudo strace -p 301 -f


[pid   339] futex(0x7f5cac0d7c54, FUTEX_WAIT_BITSET_PRIVATE, 1, {11581718, 780361675}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid   339] futex(0x7f5cac0d7c28, FUTEX_WAKE_PRIVATE, 1) = 0
[pid   339] futex(0x7f5cac0d7c54, FUTEX_WAIT_BITSET_PRIVATE, 1, {11581718, 830757941}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid   339] futex(0x7f5cac0d7c28, FUTEX_WAKE_PRIVATE, 1) = 0
[pid   339] futex(0x7f5cac0d7c54, FUTEX_WAIT_BITSET_PRIVATE, 1, {11581718, 881088118}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid   339] futex(0x7f5cac0d7c28, FUTEX_WAKE_PRIVATE, 1) = 0
[pid   339] futex(0x7f5cac0d7c54, FUTEX_WAIT_BITSET_PRIVATE, 1, {11581718, 931488956}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid   339] futex(0x7f5cac0d7c28, FUTEX_WAKE_PRIVATE, 1) = 0
[pid   339] futex(0x7f5cac0d7c54, FUTEX_WAIT_BITSET_PRIVATE, 1, {11581718, 981908072}, ffffffff <unfinished ...>
[pid   329] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid   329] futex(0x7f5cac09cc28, FUTEX_WAKE_PRIVATE, 1) = 0
[pid   329] futex(0x7f5cac09cc54, FUTEX_WAIT_BITSET_PRIVATE, 1, {11581719, 976658613}, ffffffff <unfinished ...>
[pid   339] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid   339] futex(0x7f5cac0d7c28, FUTEX_WAKE_PRIVATE, 1) = 0
[pid   339] futex(0x7f5cac0d7c54, FUTEX_WAIT_BITSET_PRIVATE, 1, {11581719, 32166788}, ffffffff) = -1 ETIMEDOUT (Connection timed out)

[pid   339] futex(0x7f5cac0d7c28, FUTEX_WAKE_PRIVATE, 1) = 0

这些工作通过协调的Oozie工作流程进入。我使用的是Hadoop 2.0.0-cdh4.3.0版本,具有2个namenode和3个Zookeeper节点。其中1个Zookeeper节点已经关闭了几个月,但不确定是否完全相关。  请让我知道这里是否还有其他内容。

0 个答案:

没有答案