Hadoop的:诊断长时间运行的工作

时间:2019-02-01 22:49:53

标签: hadoop hive mapreduce

在诊断为什么Jobtracker中的特定作业长期运行以及解决该问题的变通方法时,我需要帮助。

下面是有问题的作业的摘录(请原谅格式化):

Hadoop job_201901281553_38848                               

User: mapred                                

Job-ACLs: All users are allowed                             
Job Setup: Successful                               
Status: Running                             
Started at: Fri Feb 01 12:39:05 CST 2019                                
Running for: 3hrs, 23mins, 58sec                                
Job Cleanup: Pending                                


Kind    % Complete  Num Tasks   Pending Running Complete    Killed      Failed/Killed
                            Task Attempts
map     100.00%     1177    0   0   1177    0   0 / 0

reduce  95.20%      12  0   2   10  0   0 / 0


Counter Map Reduce  Total               
File System Counters    FILE: Number of bytes read  1,144,088,621       1,642,723,691   2,786,812,312               
FILE: Number of bytes written   3,156,884,366   1,669,567,665   4,826,452,031               
FILE: Number of read operations 0   0   0               
FILE: Number of large read operations   0   0   0               
FILE: Number of write operations    0   0   0               
HDFS: Number of bytes read  11,418,749,621  0   11,418,749,621              
HDFS: Number of bytes written   0   8,259,932,078   8,259,932,078               
HDFS: Number of read operations 2,365   5   2,370               
HDFS: Number of large read operations   0   0   0               
HDFS: Number of write operations    0   12  12              
Job Counters    Launched map tasks  0   0   1,177               
Launched reduce tasks   0   0   12              
Data-local map tasks    0   0   1,020               
Rack-local map tasks    0   0   157             
Total time spent by all maps in occupied slots (ms) 0   0   4,379,522               
Total time spent by all reduces in occupied slots (ms)  0   0   81,115,664              
Map-Reduce Framework    Map input records   77,266,616  0   77,266,616              
Map output records  77,266,616  0   77,266,616              
Map output bytes    11,442,228,060  0   11,442,228,060              
Input split bytes   177,727 0   177,727             
Combine input records   0   0   0               
Combine output records  0   0   0               
Reduce input groups 0   37,799,412  37,799,412              
Reduce shuffle bytes    0   1,853,727,946   1,853,727,946               
Reduce input records    0   76,428,913  76,428,913              
Reduce output records   0   48,958,874  48,958,874              
Spilled Records        112,586,947  62,608,254  175,195,201             
CPU time spent (ms)    2,461,980    14,831,230  17,293,210              
Physical memory (bytes) snapshot    366,933,626,880 9,982,947,328   376,916,574,208             
Virtual memory (bytes) snapshot 2,219,448,848,384   23,215,755,264  2,242,664,603,648               
Total committed heap usage (bytes)  1,211,341,733,888   8,609,333,248   1,219,951,067,136               
AcsReducer  ColumnDeletesOnTable-   0   3,284,862   3,284,862               
ColumnDeletesOnTable-   0   3,285,695   3,285,695               
ColumnDeletesOnTable-   0   3,284,862   3,284,862               
ColumnDeletesOnTable-   0   129,653 129,653             
ColumnDeletesOnTable-   0   129,653 129,653             
ColumnDeletesOnTable-   0   129,653 129,653             
ColumnDeletesOnTable-   0   129,653 129,653             
ColumnDeletesOnTable-   0   517,641 517,641             
ColumnDeletesOnTable-   0   23,786  23,786              
ColumnDeletesOnTable-   0   594,872 594,872             
ColumnDeletesOnTable-   0   597,739 597,739             
ColumnDeletesOnTable-   0   595,665 595,665             
ColumnDeletesOnTable-   0   36,101,345  36,101,345              
ColumnDeletesOnTable-   0   11,791  11,791              
ColumnDeletesOnTable-   0   11,898  11,898              
ColumnDeletesOnTable-0  176 176             
RowDeletesOnTable-  0   224,044 224,044             
RowDeletesOnTable-  0   224,045 224,045             
RowDeletesOnTable-  0   224,044 224,044             
RowDeletesOnTable-  0   17,425  17,425              
RowDeletesOnTable-  0   17,425  17,425              
RowDeletesOnTable-  0   17,425  17,425              
RowDeletesOnTable-  0   17,425  17,425              
RowDeletesOnTable-  0   459,890 459,890             
RowDeletesOnTable-  0   23,786  23,786              
RowDeletesOnTable-  0   105,910 105,910             
RowDeletesOnTable-  0   107,829 107,829             
RowDeletesOnTable-  0   105,909 105,909             
RowDeletesOnTable-  0   36,101,345  36,101,345              
RowDeletesOnTable-  0   11,353  11,353              
RowDeletesOnTable-  0   11,459  11,459              
RowDeletesOnTable-  0   168 168             
WholeRowDeletesOnTable- 0   129,930 129,930             
deleteRowsCount 0   37,799,410  37,799,410              
deleteRowsMicros    0   104,579,855,042 104,579,855,042             
emitCount   0   48,958,874  48,958,874              
emitMicros  0   201,996,180 201,996,180             
rollupValuesCount   0   37,799,412  37,799,412              
rollupValuesMicros  0   234,085,342 234,085,342             

正如你可以看到它现在已经运行近35个小时。有1177个Map任务,它们在一段时间之前完成。还原阶段不完整,达到95%。 因此,我深入研究了“ reduce”链接,并将其带到任务列表。如果我深入研究第一个未完成的任务,则为:

Job job_201901281553_38848											
											
											
											
All Task Attempts											
											
Task Attempts	Machine	Status	Progress	Start Time	Shuffle Finished	Sort Finished	Finish Time	Errors	Task Logs	Counters	Actions
attempt_201901281553_38848_r_000000_0		RUNNING	70.81%	2/1/2019 12:39	1-Feb-2019 12:39:59 (18sec)	1-Feb-2019 12:40:01 (2sec)			Last 4KB	60	
									Last 8KB		
									All		
										
							
  					

从那里我可以看到机器/ DataNode会运行任务,所以我ssh到它,我看看日志(过滤上只是有问题的任务) 从datanode $ / var / log / hadoop-0.20-mapreduce / hadoop-mapred-tasktracker-.log

2019-02-01 12:39:40,836 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction (registerTask): attempt_201901281553_38848_r_000000_0 task's state:UNASSIGNED
2019-02-01 12:39:40,838 INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_201901281553_38848_r_000000_0 which needs 1 slots
2019-02-01 12:39:40,838 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current free slots : 21 and trying to launch attempt_201901281553_38848_r_000000_0 which needs 1 slots
2019-02-01 12:39:40,925 INFO org.apache.hadoop.mapred.TaskController: Writing commands to /disk12/mapreduce/tmp-map-data/ttprivate/taskTracker/mapred/jobcache/job_201901281553_38848/attempt_201901281553_38848_r_000000_0/taskjvm.sh
2019-02-01 12:39:41,904 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201901281553_38848_r_-819481850 given task: attempt_201901281553_38848_r_000000_0
2019-02-01 12:39:49,011 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.09402435% reduce > copy (332 of 1177 at 23.66 MB/s) >
2019-02-01 12:39:56,250 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.25233644% reduce > copy (891 of 1177 at 12.31 MB/s) >
2019-02-01 12:39:59,206 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.25233644% reduce > copy (891 of 1177 at 12.31 MB/s) >
2019-02-01 12:39:59,350 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.33333334% reduce > sort
2019-02-01 12:40:01,599 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.33333334% reduce > sort
2019-02-01 12:40:02,469 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.6667039% reduce > reduce
2019-02-01 12:40:05,565 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.6667039% reduce > reduce
2019-02-01 12:40:11,666 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.6668788% reduce > reduce
2019-02-01 12:40:14,755 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.66691136% reduce > reduce
2019-02-01 12:40:17,838 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.6670001% reduce > reduce
2019-02-01 12:40:20,930 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.6671631% reduce > reduce
2019-02-01 12:40:24,016 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201901281553_38848_r_000000_0 0.6672566% reduce > reduce

..并且这些行以这种方式重复数小时.. ^所以会出现洗牌/排序阶段进行得非常快,但在此之后,它只是减少相爬行,这一比例正在缓慢增加,但在任务完成前,需要时间。

1)看起来像是瓶颈,我正确地确定了我长期工作的原因是因为此任务(以及许多类似的任务)在任务的缩减阶段花费了很长时间?

2)如果是,我有什么选择来加快速度?

在分配了该任务的数据节点及其iowait上,负载似乎相当低:

top - 15:20:03 up 124 days,  1:04,  1 user,  load average: 3.85, 5.64, 5.96
Tasks: 1095 total,   2 running, 1092 sleeping,   0 stopped,   1 zombie
Cpu(s):  3.8%us,  1.5%sy,  0.9%ni, 93.6%id,  0.2%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   503.498G total,  495.180G used, 8517.543M free, 5397.789M buffers
Swap: 2046.996M total,    0.000k used, 2046.996M free,  432.468G cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
82236 hbase     20   0 16.9g  16g  17m S 136.9  3.3  26049:16 java
30143 root      39  19  743m 621m  13m R 82.3  0.1   1782:06 clamscan
62024 mapred    20   0 2240m 1.0g  24m S 75.1  0.2   1:21.28 java
36367 mapred    20   0 1913m 848m  24m S 11.2  0.2  22:56.98 java
36567 mapred    20   0 1898m 825m  24m S  9.5  0.2  22:23.32 java
36333 mapred    20   0 1879m 880m  24m S  8.2  0.2  22:44.28 java
36374 mapred    20   0 1890m 831m  24m S  6.9  0.2  23:15.65 java

和iostat -xm 4的摘要:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.15    0.92    0.30    0.17    0.00   96.46

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00   350.25    0.00   30.00     0.00     1.49   101.67     0.02    0.71    0.00    0.71   0.04   0.12
sdb               0.00     2.75    0.00    6.00     0.00     0.03    11.67     0.00    0.00    0.00    0.00   0.00   0.00
sdd               0.00     9.75    0.00    1.25     0.00     0.04    70.40     0.00    0.00    0.00    0.00   0.00   0.00
sdf               0.00     6.50    0.00    0.75     0.00     0.03    77.33     0.00    0.00    0.00    0.00   0.00   0.00
sdg               0.00     5.75    0.00    0.50     0.00     0.02   100.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00     8.00    0.00    0.75     0.00     0.03    93.33     0.00    0.00    0.00    0.00   0.00   0.00
sdh               0.00     6.25    0.00    0.50     0.00     0.03   108.00     0.00    0.00    0.00    0.00   0.00   0.00
sdi               0.00     3.75   93.25    0.50     9.03     0.02   197.57     0.32    3.18    3.20    0.00   1.95  18.30
sdj               0.00     3.50    0.00    0.50     0.00     0.02    64.00     0.00    0.00    0.00    0.00   0.00   0.00
sdk               0.00     7.00    0.00    0.75     0.00     0.03    82.67     0.00    0.33    0.00    0.33   0.33   0.03
sdl               0.00     6.75    0.00    0.75     0.00     0.03    80.00     0.00    0.00    0.00    0.00   0.00   0.00
sdm               0.00     7.75    0.00    5.75     0.00     0.05    18.78     0.00    0.04    0.00    0.04   0.04   0.03

@<machine>:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3        40G  5.9G   32G  16% /
tmpfs           252G     0  252G   0% /dev/shm
/dev/sda1       488M  113M  350M  25% /boot
/dev/sda8        57G  460M   54G   1% /tmp
/dev/sda7       9.8G  1.1G  8.2G  12% /var
/dev/sda5        40G   17G   21G  45% /var/log
/dev/sda6        30G  4.4G   24G  16% /var/log/audit.d
/dev/sdb1       7.2T  3.3T  3.6T  48% /disk1
/dev/sdc1       7.2T  3.3T  3.6T  49% /disk2
/dev/sdd1       7.2T  3.3T  3.6T  48% /disk3
/dev/sde1       7.2T  3.3T  3.6T  48% /disk4
/dev/sdf1       7.2T  3.3T  3.6T  48% /disk5
/dev/sdi1       7.2T  3.3T  3.6T  48% /disk6
/dev/sdg1       7.2T  3.3T  3.6T  48% /disk7
/dev/sdh1       7.2T  3.3T  3.6T  48% /disk8
/dev/sdj1       7.2T  3.3T  3.6T  48% /disk9
/dev/sdk1       7.2T  3.3T  3.6T  48% /disk10
/dev/sdm1       7.2T  3.3T  3.6T  48% /disk11
/dev/sdl1       7.2T  3.3T  3.6T  48% /disk12

这是Hadoop 2.0.0-cdh4.3.0版本。它具有3个zookeeper节点,2个namenode和35个datanode的高可用性。未安装YARN。使用hbase,oozie。工作主要通过Hive和HUE来。

每个数据节点具有2个物理CPU,每个具有22芯。启用超线程。

如果您需要更多信息,请告诉我。我的猜测是,也许我需要更多的减速,也有mapred-site.xml中设置,需要调整,或许可以从地图相输入数据过大,或书面更好的蜂巢的查询需求。林相当新的管理员到Hadoop的,任何详细的建议是巨大的。

谢谢!

0 个答案:

没有答案