Hadoop重试失败的作业

时间:2014-11-27 11:42:29

标签: hadoop mapreduce apache-pig jobs

我是一个hadoop noob,我正在接管一位目前不在的同事的一些数据,所以如果可以的话,请在你的答案中加入明显的信息。

我正在运行map reduce任务,并且由于一个或多个作业失败,任务通常会失败。在多次重试任务后,如果运气不好,他们最终会成功。错误日志很长,所以如果需要更多的错误日志,请大声喊叫。

错误日志中有很多重复,而且很多信息日志,我大部分都遗漏了。 snape,minerva,hagrid,dumbledore是本地服务器的名称。

我有没有办法告诉hadoop重试失败的作业,而不是花费大量时间在任务上,只是为了最终删除输出,因为它有错误?
目前我必须多次重试这些任务(我已经完成了10次任务,但他们仍然没有给我带来成功)。这看起来很傻。有什么想法吗?

...
...
014-11-27 12:30:04,453 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1417008757700_3632,job_1417008757700_3633,job_1417008757700_3636,job_1417008757700_3637,job_1417008757700_3638]
2014-11-27 12:30:17,373 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 77% complete
2014-11-27 12:30:17,373 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1417008757700_3633,job_1417008757700_3636,job_1417008757700_3637,job_1417008757700_3638]
2014-11-27 12:30:21,390 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 82% complete
2014-11-27 12:30:21,390 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1417008757700_3633,job_1417008757700_3636,job_1417008757700_3637,job_1417008757700_3638]
2014-11-27 12:30:34,486 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 87% complete
2014-11-27 12:30:34,487 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1417008757700_3633,job_1417008757700_3637,job_1417008757700_3638]
2014-11-27 12:30:57,726 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_1417008757700_3626 has failed! Stop running all dependent jobs
2014-11-27 12:30:57,726 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_1417008757700_3633 has failed! Stop running all dependent jobs
2014-11-27 12:30:57,726 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2014-11-27 12:30:58,728 [main] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: snape/192.168.0.23:55798. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2014-11-27 12:30:59,729 [main] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: snape/192.168.0.23:55798. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2014-11-27 12:31:00,730 [main] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: snape/192.168.0.23:55798. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2014-11-27 12:31:00,834 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=FAILED. Redirecting to job history server
2014-11-27 12:31:00,874 [main] ERROR org.apache.pig.tools.pigstats.PigStats - ERROR 0: org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to recreate exception from backed error: Error: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /ccg_map_3/2012_08/gp_r_tops_by_total_c.csv/_temporary/1/_temporary/attempt_1417008757700_3626_r_000000_3/part-r-00000 (inode 202543): File does not exist. Holder DFSClient_attempt_1417008757700_3626_r_000000_3_838272402_1 does not have any open files.
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3083)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3170)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3140)
...
...
...
...
2014-11-27 12:31:04,150 [main] ERROR org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil - 2 map reduce job(s) failed!
2014-11-27 12:31:04,154 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics: 

HadoopVersion   PigVersion  UserId  StartedAt   FinishedAt  Features
2.5.1   0.13.0  hduser  2014-11-27 12:27:08 2014-11-27 12:31:04 HASH_JOIN

Some jobs have failed! Stop running all dependent jobs

Job Stats (time in seconds):
JobId   Maps    Reduces MaxMapTime  MinMapTIme  AvgMapTime  MedianMapTime   MaxReduceTime   MinReduceTime   AvgReduceTime   MedianReducetime    Alias   Feature Outputs
job_1417008757700_3627  2   1   11  4   8   8   12  12  12  12  r_cleaned,r_joined,r_metrics,r_top_cats HASH_JOIN   /ccg_map_3/2012_08/gp_r_tops_by_total_c.csv,
job_1417008757700_3628  2   1   10  4   7   7   26  26  26  26  r_cleaned,r_joined,r_metrics,r_top_cats HASH_JOIN   /ccg_map_3/2012_08/gp_r_tops_by_total_c.csv,
job_1417008757700_3629  2   1   11  5   8   8   38  38  38  38  r_cleaned,r_joined,r_metrics,r_top_cats HASH_JOIN   /ccg_map_3/2012_08/gp_r_tops_by_total_c.csv,
job_1417008757700_3630  2   1   9   4   7   7   40  40  40  40  r_cleaned,r_joined,r_metrics,r_top_cats HASH_JOIN   /ccg_map_3/2012_08/gp_r_tops_by_total_c.csv,
job_1417008757700_3631  2   1   9   3   6   6   10  10  10  10  r_cleaned,r_joined,r_metrics,r_top_cats HASH_JOIN   /ccg_map_3/2012_08/gp_r_tops_by_total_c.csv,
job_1417008757700_3632  2   1   10  3   7   7   49  49  49  49  r_cleaned,r_joined,r_metrics,r_top_cats HASH_JOIN   /ccg_map_3/2012_08/gp_r_tops_by_total_c.csv,
job_1417008757700_3634  2   1   10  4   7   7   10  10  10  10  r_cleaned,r_joined,r_metrics,r_top_cats HASH_JOIN   /ccg_map_3/2012_08/gp_r_tops_by_total_c.csv,
job_1417008757700_3635  2   1   10  4   7   7   10  10  10  10  r_cleaned,r_joined,r_metrics,r_top_cats HASH_JOIN   /ccg_map_3/2012_08/gp_r_tops_by_total_c.csv,
job_1417008757700_3636  2   1   9   3   6   6   14  14  14  14  r_cleaned,r_joined,r_metrics,r_top_cats HASH_JOIN   /ccg_map_3/2012_08/gp_r_tops_by_total_c.csv,
job_1417008757700_3637  2   1   12  5   8   8   17  17  17  17  r_cleaned,r_joined,r_metrics,r_top_cats HASH_JOIN   /ccg_map_3/2012_08/gp_r_tops_by_total_c.csv,
job_1417008757700_3638  2   1   9   3   6   6   30  30  30  30  r_cleaned,r_joined,r_metrics,r_top_cats HASH_JOIN   /ccg_map_3/2012_08/gp_r_tops_by_total_c.csv,

Failed Jobs:
JobId   Alias   Feature Message Outputs
job_1417008757700_3626  r_cleaned,r_joined,r_metrics,r_top_cats HASH_JOIN   Message: Job failed!    /ccg_map_3/2012_08/gp_r_tops_by_total_c.csv,
job_1417008757700_3633  r_cleaned,r_joined,r_metrics,r_top_cats HASH_JOIN   Message: Job failed!    /ccg_map_3/2012_08/gp_r_tops_by_total_c.csv,

Input(s):
Failed to read data from "/ccg_map_3/_GLOBAL/england_r_tops_by_total_c.csv"
Failed to read data from "/ccg_map_3/2012_08/r_gp_prescription_metrics_final.csv"
Successfully read 10 records from: "/ccg_map_3/_GLOBAL/england_r_tops_by_total_c.csv"
Successfully read 150124 records from: "/ccg_map_3/2012_08/r_gp_prescription_metrics_final.csv"
Successfully read 10 records from: "/ccg_map_3/_GLOBAL/england_r_tops_by_total_c.csv"
Successfully read 150124 records from: "/ccg_map_3/2012_08/r_gp_prescription_metrics_final.csv"
Successfully read 10 records from: "/ccg_map_3/_GLOBAL/england_r_tops_by_total_c.csv"
Successfully read 150124 records from: "/ccg_map_3/2012_08/r_gp_prescription_metrics_final.csv"
Successfully read 10 records from: "/ccg_map_3/_GLOBAL/england_r_tops_by_total_c.csv"
Successfully read 150124 records from: "/ccg_map_3/2012_08/r_gp_prescription_metrics_final.csv"
Successfully read 150124 records from: "/ccg_map_3/2012_08/r_gp_prescription_metrics_final.csv"
Successfully read 10 records from: "/ccg_map_3/_GLOBAL/england_r_tops_by_total_c.csv"
Successfully read 150124 records from: "/ccg_map_3/2012_08/r_gp_prescription_metrics_final.csv"
Successfully read 10 records from: "/ccg_map_3/_GLOBAL/england_r_tops_by_total_c.csv"
Failed to read data from "/ccg_map_3/2012_08/r_gp_prescription_metrics_final.csv"
Failed to read data from "/ccg_map_3/_GLOBAL/england_r_tops_by_total_c.csv"
Successfully read 150124 records from: "/ccg_map_3/2012_08/r_gp_prescription_metrics_final.csv"
Successfully read 10 records from: "/ccg_map_3/_GLOBAL/england_r_tops_by_total_c.csv"
Successfully read 150124 records from: "/ccg_map_3/2012_08/r_gp_prescription_metrics_final.csv"
Successfully read 10 records from: "/ccg_map_3/_GLOBAL/england_r_tops_by_total_c.csv"
Successfully read 10 records from: "/ccg_map_3/_GLOBAL/england_r_tops_by_total_c.csv"
Successfully read 150124 records from: "/ccg_map_3/2012_08/r_gp_prescription_metrics_final.csv"
Successfully read 10 records from: "/ccg_map_3/_GLOBAL/england_r_tops_by_total_c.csv"
Successfully read 150124 records from: "/ccg_map_3/2012_08/r_gp_prescription_metrics_final.csv"
Successfully read 150124 records from: "/ccg_map_3/2012_08/r_gp_prescription_metrics_final.csv"
Successfully read 10 records from: "/ccg_map_3/_GLOBAL/england_r_tops_by_total_c.csv"

Output(s):
Failed to produce result in "/ccg_map_3/2012_08/gp_r_tops_by_total_c.csv"
Successfully stored 76033 records (67971674 bytes) in: "/ccg_map_3/2012_08/gp_r_tops_by_total_c.csv"
Successfully stored 76033 records (67971674 bytes) in: "/ccg_map_3/2012_08/gp_r_tops_by_total_c.csv"
Successfully stored 76033 records (67971674 bytes) in: "/ccg_map_3/2012_08/gp_r_tops_by_total_c.csv"
Successfully stored 76033 records (67971674 bytes) in: "/ccg_map_3/2012_08/gp_r_tops_by_total_c.csv"
Successfully stored 76033 records (67971674 bytes) in: "/ccg_map_3/2012_08/gp_r_tops_by_total_c.csv"
Successfully stored 76033 records (67971674 bytes) in: "/ccg_map_3/2012_08/gp_r_tops_by_total_c.csv"
Failed to produce result in "/ccg_map_3/2012_08/gp_r_tops_by_total_c.csv"
Successfully stored 76033 records (67971674 bytes) in: "/ccg_map_3/2012_08/gp_r_tops_by_total_c.csv"
Successfully stored 76033 records (67971674 bytes) in: "/ccg_map_3/2012_08/gp_r_tops_by_total_c.csv"
Successfully stored 76033 records (67971674 bytes) in: "/ccg_map_3/2012_08/gp_r_tops_by_total_c.csv"
Successfully stored 76033 records (67971674 bytes) in: "/ccg_map_3/2012_08/gp_r_tops_by_total_c.csv"
Successfully stored 76033 records (67971674 bytes) in: "/ccg_map_3/2012_08/gp_r_tops_by_total_c.csv"

Counters:
Total records written : 836363
Total bytes written : 747688414
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1417008757700_3626
job_1417008757700_3627
job_1417008757700_3628
job_1417008757700_3629
job_1417008757700_3630
job_1417008757700_3631
job_1417008757700_3632
job_1417008757700_3633
job_1417008757700_3634
job_1417008757700_3635
job_1417008757700_3636
job_1417008757700_3637
job_1417008757700_3638


2014-11-27 12:31:05,155 [main] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: hagrid/192.168.0.24:47581. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2014-11-27 12:31:06,155 [main] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: hagrid/192.168.0.24:47581. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2014-11-27 12:31:07,156 [main] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: hagrid/192.168.0.24:47581. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2014-11-27 12:31:07,258 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2014-11-27 12:31:08,288 [main] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: minerva/192.168.0.22:44668. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2014-11-27 12:31:09,288 [main] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: minerva/192.168.0.22:44668. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2014-11-27 12:31:10,289 [main] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: minerva/192.168.0.22:44668. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2014-11-27 12:31:10,390 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2014-11-27 12:31:11,423 [main] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: dumbledore/192.168.0.21:55195. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2014-11-27 12:31:12,424 [main] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: dumbledore/192.168.0.21:55195. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2014-11-27 12:31:13,425 [main] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: dumbledore/192.168.0.21:55195. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
2014-11-27 12:31:13,527 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
...
...
...

0 个答案:

没有答案