在诊断为什么Jobtracker中的特定作业长期运行以及解决该问题的变通方法时,我需要帮助。
下面是有问题的作业的摘录(请原谅格式化):
Hadoop job_201901281553_38848
User: mapred
Job-ACLs: All users are allowed
Job Setup: Successful
Status: Running
Started at: Fri Feb 01 12:39:05 CST 2019
Running for: 3hrs, 23mins, 58sec
Job Cleanup: Pending
Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed
Task Attempts
map 100.00% 1177 0 0 1177 0 0 / 0
reduce 95.20% 12 0 2 10 0 0 / 0
Counter Map Reduce Total
File System Counters FILE: Number of bytes read 1,144,088,621 1,642,723,691 2,786,812,312
FILE: Number of bytes written 3,156,884,366 1,669,567,665 4,826,452,031
FILE: Number of read operations 0 0 0
FILE: Number of large read operations 0 0 0
FILE: Number of write operations 0 0 0
HDFS: Number of bytes read 11,418,749,621 0 11,418,749,621
HDFS: Number of bytes written 0 8,259,932,078 8,259,932,078
HDFS: Number of read operations 2,365 5 2,370
HDFS: Number of large read operations 0 0 0
HDFS: Number of write operations 0 12 12
Job Counters Launched map tasks 0 0 1,177
Launched reduce tasks 0 0 12
Data-local map tasks 0 0 1,020
Rack-local map tasks 0 0 157
Total time spent by all maps in occupied slots (ms) 0 0 4,379,522
Total time spent by all reduces in occupied slots (ms) 0 0 81,115,664
Map-Reduce Framework Map input records 77,266,616 0 77,266,616
Map output records 77,266,616 0 77,266,616
Map output bytes 11,442,228,060 0 11,442,228,060
Input split bytes 177,727 0 177,727
Combine input records 0 0 0
Combine output records 0 0 0
Reduce input groups 0 37,799,412 37,799,412
Reduce shuffle bytes 0 1,853,727,946 1,853,727,946
Reduce input records 0 76,428,913 76,428,913
Reduce output records 0 48,958,874 48,958,874
Spilled Records 112,586,947 62,608,254 175,195,201
CPU time spent (ms) 2,461,980 14,831,230 17,293,210
Physical memory (bytes) snapshot 366,933,626,880 9,982,947,328 376,916,574,208
Virtual memory (bytes) snapshot 2,219,448,848,384 23,215,755,264 2,242,664,603,648
Total committed heap usage (bytes) 1,211,341,733,888 8,609,333,248 1,219,951,067,136
AcsReducer ColumnDeletesOnTable- 0 3,284,862 3,284,862
ColumnDeletesOnTable- 0 3,285,695 3,285,695
ColumnDeletesOnTable- 0 3,284,862 3,284,862
ColumnDeletesOnTable- 0 129,653 129,653
ColumnDeletesOnTable- 0 129,653 129,653
ColumnDeletesOnTable- 0 129,653 129,653
ColumnDeletesOnTable- 0 129,653 129,653
ColumnDeletesOnTable- 0 517,641 517,641
ColumnDeletesOnTable- 0 23,786 23,786
ColumnDeletesOnTable- 0 594,872 594,872
ColumnDeletesOnTable- 0 597,739 597,739
ColumnDeletesOnTable- 0 595,665 595,665
ColumnDeletesOnTable- 0 36,101,345 36,101,345
ColumnDeletesOnTable- 0 11,791 11,791
ColumnDeletesOnTable- 0 11,898 11,898
ColumnDeletesOnTable-0 176 176
RowDeletesOnTable- 0 224,044 224,044
RowDeletesOnTable- 0 224,045 224,045
RowDeletesOnTable- 0 224,044 224,044
RowDeletesOnTable- 0 17,425 17,425
RowDeletesOnTable- 0 17,425 17,425
RowDeletesOnTable- 0 17,425 17,425
RowDeletesOnTable- 0 17,425 17,425
RowDeletesOnTable- 0 459,890 459,890
RowDeletesOnTable- 0 23,786 23,786
RowDeletesOnTable- 0 105,910 105,910
RowDeletesOnTable- 0 107,829 107,829
RowDeletesOnTable- 0 105,909 105,909
RowDeletesOnTable- 0 36,101,345 36,101,345
RowDeletesOnTable- 0 11,353 11,353
RowDeletesOnTable- 0 11,459 11,459
RowDeletesOnTable- 0 168 168
WholeRowDeletesOnTable- 0 129,930 129,930
deleteRowsCount 0 37,799,410 37,799,410
deleteRowsMicros 0 104,579,855,042 104,579,855,042
emitCount 0 48,958,874 48,958,874
emitMicros 0 201,996,180 201,996,180
rollupValuesCount 0 37,799,412 37,799,412
rollupValuesMicros 0 234,085,342 234,085,342
正如你可以看到它现在已经运行近35个小时。有1177个Map任务,它们在一段时间之前完成。还原阶段不完整,达到95%。 因此,我深入研究了“ reduce”链接,并将其带到任务列表。如果我深入研究第一个未完成的任务,则为:
Job job_201901281553_38848
All Task Attempts
Task Attempts Machine Status Progress Start Time Shuffle Finished Sort Finished Finish Time Errors Task Logs Counters Actions
attempt_201901281553_38848_r_000000_0 RUNNING 70.81% 2/1/2019 12:39 1-Feb-2019 12:39:59 (18sec) 1-Feb-2019 12:40:01 (2sec) Last 4KB 60
Last 8KB
All
从那里我可以看到机器/ DataNode会运行任务,所以我ssh到它,我看看日志(过滤上只是有问题的任务) 从datanode $ / var / log / hadoop-0.20-mapreduce / hadoop-mapred-tasktracker-.log
2019-02-01 12:39:40,836 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction (registerTask): attempt_201901281553_38848_r_000000_0 task's state:UNASSIGNED
2019-02-01 12:39:40,838 INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_201901281553_38848_r_000000_0 which needs 1 slots
2019-02-01 12:39:40,838 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current free slots : 21 and trying to launch attempt_201901281553_38848_r_000000_0 which needs 1 slots
2019-02-01 12:39:40,925 INFO org.apache.hadoop.mapred.TaskController: Writing commands to /disk12/mapreduce/tmp-map-data/ttprivate/taskTracker/mapred/jobcache/job_201901281553_38848/attempt_201901281553_38848_r_000000_0/taskjvm.sh
2019-02-01 12:39:41,904 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201901281553_38848_r_-819481850 given task: attempt_201901281553_38848_r_000000_0
2019-02-01 12:39:49,011 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.09402435% reduce > copy (332 of 1177 at 23.66 MB/s) >
2019-02-01 12:39:56,250 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.25233644% reduce > copy (891 of 1177 at 12.31 MB/s) >
2019-02-01 12:39:59,206 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.25233644% reduce > copy (891 of 1177 at 12.31 MB/s) >
2019-02-01 12:39:59,350 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.33333334% reduce > sort
2019-02-01 12:40:01,599 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.33333334% reduce > sort
2019-02-01 12:40:02,469 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.6667039% reduce > reduce
2019-02-01 12:40:05,565 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.6667039% reduce > reduce
2019-02-01 12:40:11,666 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.6668788% reduce > reduce
2019-02-01 12:40:14,755 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.66691136% reduce > reduce
2019-02-01 12:40:17,838 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.6670001% reduce > reduce
2019-02-01 12:40:20,930 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.6671631% reduce > reduce
2019-02-01 12:40:24,016 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.6672566% reduce > reduce
..并且这些行以这种方式重复数小时.. ^所以会出现洗牌/排序阶段进行得非常快,但在此之后,它只是减少相爬行,这一比例正在缓慢增加,但在任务完成前,需要时间。
1)看起来像是瓶颈,我正确地确定了我长期工作的原因是因为此任务(以及许多类似的任务)在任务的缩减阶段花费了很长时间?
2)如果是,我有什么选择来加快速度?
在分配了该任务的数据节点及其iowait上,负载似乎相当低:
top - 15:20:03 up 124 days, 1:04, 1 user, load average: 3.85, 5.64, 5.96
Tasks: 1095 total, 2 running, 1092 sleeping, 0 stopped, 1 zombie
Cpu(s): 3.8%us, 1.5%sy, 0.9%ni, 93.6%id, 0.2%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 503.498G total, 495.180G used, 8517.543M free, 5397.789M buffers
Swap: 2046.996M total, 0.000k used, 2046.996M free, 432.468G cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
82236 hbase 20 0 16.9g 16g 17m S 136.9 3.3 26049:16 java
30143 root 39 19 743m 621m 13m R 82.3 0.1 1782:06 clamscan
62024 mapred 20 0 2240m 1.0g 24m S 75.1 0.2 1:21.28 java
36367 mapred 20 0 1913m 848m 24m S 11.2 0.2 22:56.98 java
36567 mapred 20 0 1898m 825m 24m S 9.5 0.2 22:23.32 java
36333 mapred 20 0 1879m 880m 24m S 8.2 0.2 22:44.28 java
36374 mapred 20 0 1890m 831m 24m S 6.9 0.2 23:15.65 java
和iostat -xm 4的摘要:
avg-cpu: %user %nice %system %iowait %steal %idle
2.15 0.92 0.30 0.17 0.00 96.46
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 350.25 0.00 30.00 0.00 1.49 101.67 0.02 0.71 0.00 0.71 0.04 0.12
sdb 0.00 2.75 0.00 6.00 0.00 0.03 11.67 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 9.75 0.00 1.25 0.00 0.04 70.40 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 6.50 0.00 0.75 0.00 0.03 77.33 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 5.75 0.00 0.50 0.00 0.02 100.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 8.00 0.00 0.75 0.00 0.03 93.33 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 6.25 0.00 0.50 0.00 0.03 108.00 0.00 0.00 0.00 0.00 0.00 0.00
sdi 0.00 3.75 93.25 0.50 9.03 0.02 197.57 0.32 3.18 3.20 0.00 1.95 18.30
sdj 0.00 3.50 0.00 0.50 0.00 0.02 64.00 0.00 0.00 0.00 0.00 0.00 0.00
sdk 0.00 7.00 0.00 0.75 0.00 0.03 82.67 0.00 0.33 0.00 0.33 0.33 0.03
sdl 0.00 6.75 0.00 0.75 0.00 0.03 80.00 0.00 0.00 0.00 0.00 0.00 0.00
sdm 0.00 7.75 0.00 5.75 0.00 0.05 18.78 0.00 0.04 0.00 0.04 0.04 0.03
@<machine>:~$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 40G 5.9G 32G 16% /
tmpfs 252G 0 252G 0% /dev/shm
/dev/sda1 488M 113M 350M 25% /boot
/dev/sda8 57G 460M 54G 1% /tmp
/dev/sda7 9.8G 1.1G 8.2G 12% /var
/dev/sda5 40G 17G 21G 45% /var/log
/dev/sda6 30G 4.4G 24G 16% /var/log/audit.d
/dev/sdb1 7.2T 3.3T 3.6T 48% /disk1
/dev/sdc1 7.2T 3.3T 3.6T 49% /disk2
/dev/sdd1 7.2T 3.3T 3.6T 48% /disk3
/dev/sde1 7.2T 3.3T 3.6T 48% /disk4
/dev/sdf1 7.2T 3.3T 3.6T 48% /disk5
/dev/sdi1 7.2T 3.3T 3.6T 48% /disk6
/dev/sdg1 7.2T 3.3T 3.6T 48% /disk7
/dev/sdh1 7.2T 3.3T 3.6T 48% /disk8
/dev/sdj1 7.2T 3.3T 3.6T 48% /disk9
/dev/sdk1 7.2T 3.3T 3.6T 48% /disk10
/dev/sdm1 7.2T 3.3T 3.6T 48% /disk11
/dev/sdl1 7.2T 3.3T 3.6T 48% /disk12
这是Hadoop 2.0.0-cdh4.3.0版本。它具有3个zookeeper节点,2个namenode和35个datanode的高可用性。未安装YARN。使用hbase,oozie。工作主要通过Hive和HUE来。
每个数据节点具有2个物理CPU,每个具有22芯。启用超线程。
如果您需要更多信息,请告诉我。我的猜测是,也许我需要更多的减速,也有mapred-site.xml中设置,需要调整,或许可以从地图相输入数据过大,或书面更好的蜂巢的查询需求。林相当新的管理员到Hadoop的,任何详细的建议是巨大的。
谢谢!