Question

我正在使用8个VCPU和15G RAM的10个Worker节点运行EMR。输入文件大小约为7G。

这是map / reducer配置：

SET job.name 'correlations';
SET pig.exec.reducers.bytes.per.reducer 2147483648; 
SET pig.exec.reducers.max 60;
SET pig.splitCombination true; 
SET mapred.min.split.size 268435456;

流程耗时差不多10个小时，之后失败了。我正在寻求帮助来优化和修复这个异常，我是相对较新的EMR。这适用于具有相同配置的Rackspace环境中的HDP（过去需要3-4天才能完成）

以下是完整的猪脚本：

-- run from gateway node

-- SET DEFAULT_PARALLEL 100;
SET job.name 'correlations';
SET pig.exec.reducers.bytes.per.reducer 2147483648; 
SET pig.exec.reducers.max 60;
SET pig.splitCombination true; 
SET mapred.min.split.size 268435456;

-- load votes from HDFS and do self join
votes1 = LOAD '<input file path>/filtered_votes.txt' USING PigStorage(',') AS (uid1: long, lnid1: long, t1: int);
votes2 = LOAD '<input file path>/list_item_correlation/filtered_votes.txt' USING PigStorage(',') AS (uid2: long, lnid2: long, t2: int);
pairs = JOIN votes1 BY uid1, votes2 BY uid2;

-- eliminate self and symmetric correlations
required_pairs = FILTER pairs BY (lnid1 < lnid2);

flags = FOREACH required_pairs GENERATE lnid1, lnid2,
                                        ( (t1 == 1 AND t2 == 1) ? 1 : 0 ) AS uu,
                                        ( (t1 == 1 AND t2 == 0) ? 1 : 0 ) AS ud,
                                        ( (t1 == 0 AND t2 == 1) ? 1 : 0 ) AS du,
                                        ( (t1 == 0 AND t2 == 0) ? 1 : 0 ) AS dd;

grouped_flags = GROUP flags BY (lnid1, lnid2);

counted = FOREACH grouped_flags GENERATE group AS ids, 
                                         SUM(flags.uu) AS suu, 
                                         SUM(flags.ud) AS sud, 
                                         SUM(flags.du) AS sdu, 
                                         SUM(flags.dd) AS sdd;

-- restrict to items with at least 30 common voters
-- (use 0 when testing)
fltrd = FILTER counted BY (suu + sud + sdu + sdd >= 30);

-- avoid divide by 0 errors when computing odds ratios below
corr1 = FOREACH fltrd GENERATE ids.lnid1 AS lnid1, ids.lnid2 AS lnid2, 
                                MAX(TOBAG(suu, 1L)) AS uu, 
                                MAX(TOBAG(sud, 1L)) AS ud, 
                                MAX(TOBAG(sdu, 1L)) AS du, 
                                MAX(TOBAG(sdd, 1L)) AS dd;
-- symmetric pair
corr2 = FOREACH fltrd GENERATE ids.lnid2 AS lnid1, ids.lnid1 AS lnid2, 
                                MAX(TOBAG(suu, 1L)) AS uu, 
                                MAX(TOBAG(sdu, 1L)) AS ud, 
                                MAX(TOBAG(sud, 1L)) AS du, 
                                MAX(TOBAG(sdd, 1L)) AS dd;
-- union
correlations = UNION corr1, corr2;

-- generate vote counts
vote_flags = FOREACH votes1 GENERATE lnid1, (t1 == 1 ? 1 : 0) AS up, (t1 == 0 ? 1 : 0) AS dn;
grpd = GROUP vote_flags BY lnid1;
vote_counts = FOREACH grpd GENERATE group AS lnid, SUM(vote_flags.up) AS up, SUM(vote_flags.dn) AS dn;

-- JOIN vote counts and correlations
jnd = JOIN vote_counts BY lnid, correlations BY lnid2;

-- avoid divide by 0 errors
jnd2 = FOREACH jnd GENERATE lnid1, lnid2, uu, ud, du, dd, up, dn, 
                            MAX(TOBAG(dn-ud, 1L)) AS dnud, 
                            MAX(TOBAG(up-uu, 1L)) AS upuu, 
                            MAX(TOBAG(dn-dd, 1L)) AS dndd, 
                            MAX(TOBAG(up-du, 1L)) AS updu;

-- calculate all the odds ratios
odds = FOREACH jnd2 GENERATE lnid1, lnid2, uu, ud, du, dd,
         (1.0 * uu * dd) / (ud * du) AS odds,
         EXP ( LOG ( (1.0 * uu * dd) / (ud * du) ) + (1.96 * SQRT ( (1.0 / uu) + (1.0 / ud) + (1.0 / du) + (1.0 / dd) )) ) AS high,
         EXP ( LOG ( (1.0 * uu * dd) / (ud * du) ) - (1.96 * SQRT ( (1.0 / uu) + (1.0 / ud) + (1.0 / du) + (1.0 / dd) )) ) AS low,

         (1.0 * uu * dnud) / (ud * upuu) AS odds_p,
         EXP ( LOG ( (1.0 * uu * dnud) / (ud * upuu) ) + (1.96 * SQRT ( (1.0 / uu) + (1.0 / ud) + (1.0 / upuu) + (1.0 / dnud) )) ) AS high_p,
         EXP ( LOG ( (1.0 * uu * dnud) / (ud * upuu) ) - (1.96 * SQRT ( (1.0 / uu) + (1.0 / ud) + (1.0 / upuu) + (1.0 / dnud) )) ) AS low_p,

         (1.0 * du * dndd) / (dd * updu) AS odds_n,
         EXP ( LOG ( (1.0 * du * dndd) / (dd * updu) ) + (1.96 * SQRT ( (1.0 / du) + (1.0 / dd) + (1.0 / updu) + (1.0 / dndd) )) ) AS high_n,
         EXP ( LOG ( (1.0 * du * dndd) / (dd * updu) ) - (1.96 * SQRT ( (1.0 / du) + (1.0 / dd) + (1.0 / updu) + (1.0 / dndd) )) ) AS low_n;

STORE odds INTO '<output location>';

EMR系统日志错误：

............
............

2017-07-06 11:26:11,346 INFO org.apache.tez.common.counters.Limits (PigTezLauncher-0): Counter limits initialized with parameters:  GROUP_NAME_MAX=256, MAX_GROUPS=500, COUNTER_NAME_MAX=64, MAX_COUNTERS=120
2017-07-06 11:26:11,351 INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob (PigTezLauncher-0): DAG Status: status=FAILED, progress=TotalTasks: 196 Succeeded: 61 Running: 0 Failed: 1 Killed: 134 FailedTaskAttempts: 73 KilledTaskAttempts: 143, diagnostics=Vertex re-running, vertexName=scope-615, vertexId=vertex_1499292782644_0001_1_00
Vertex re-running, vertexName=scope-622, vertexId=vertex_1499292782644_0001_1_02
Vertex re-running, vertexName=scope-615, vertexId=vertex_1499292782644_0001_1_00
Vertex re-running, vertexName=scope-622, vertexId=vertex_1499292782644_0001_1_02
Vertex re-running, vertexName=scope-619, vertexId=vertex_1499292782644_0001_1_01
Vertex re-running, vertexName=scope-622, vertexId=vertex_1499292782644_0001_1_02
Vertex re-running, vertexName=scope-615, vertexId=vertex_1499292782644_0001_1_00
Vertex re-running, vertexName=scope-622, vertexId=vertex_1499292782644_0001_1_02
Vertex re-running, vertexName=scope-615, vertexId=vertex_1499292782644_0001_1_00
Vertex failed, vertexName=scope-623, vertexId=vertex_1499292782644_0001_1_03, diagnostics=[Task failed, taskId=task_1499292782644_0001_1_03_000034, diagnostics=[TaskAttempt 0 killed, TaskAttempt 1 failed, info=[Container container_1499292782644_0001_01_000255 finished with diagnostics set to [Container failed, exitCode=-100. Container released on a *lost* node]], TaskAttempt 2 killed, TaskAttempt 3 failed, info=[Error: Error while running task ( failure ) : org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: error in shuffle in Fetcher {scope_615} #7
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:301)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:285)
    at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: scope_615: Shuffle failed with too many fetch failures and insufficient progress!failureCounts=12, pendingInputs=12, fetcherHealthy=false, reducerProgressedEnough=true, reducerStalled=true
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.isShuffleHealthy(ShuffleScheduler.java:977)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:718)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:376)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:260)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:178)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:191)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:54)
    ... 5 more
, errorMessage=Shuffle Runner Failed:org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: error in shuffle in Fetcher {scope_615} #7
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:301)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:285)
    at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: scope_615: Shuffle failed with too many fetch failures and insufficient progress!failureCounts=12, pendingInputs=12, fetcherHealthy=false, reducerProgressedEnough=true, reducerStalled=true
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.isShuffleHealthy(ShuffleScheduler.java:977)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:718)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:376)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:260)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:178)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:191)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:54)
    ... 5 more
], TaskAttempt 4 killed, TaskAttempt 5 failed, info=[Error: Error while running task ( failure ) : org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: error in shuffle in Fetcher {scope_615} #2
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:301)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:285)
    at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Failed 15 times trying to download from scope-615_000020_00. threshold=15
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.isAbortLimitExceeedFor(ShuffleScheduler.java:740)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.isShuffleHealthy(ShuffleScheduler.java:930)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:718)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupLocalDiskFetch(FetcherOrderedGrouped.java:696)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:175)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:191)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:54)
    ... 5 more
, errorMessage=Shuffle Runner Failed:org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: error in shuffle in Fetcher {scope_615} #2
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:301)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:285)
    at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Failed 15 times trying to download from scope-615_000020_00. threshold=15
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.isAbortLimitExceeedFor(ShuffleScheduler.java:740)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.isShuffleHealthy(ShuffleScheduler.java:930)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:718)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupLocalDiskFetch(FetcherOrderedGrouped.java:696)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:175)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:191)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:54)
    ... 5 more
], TaskAttempt 6 failed, info=[Error: Error while running task ( failure ) : org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: error in shuffle in Fetcher {scope_615} #3
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:301)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:285)
    at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Failed 15 times trying to download from scope-615_000003_00. threshold=15
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.isAbortLimitExceeedFor(ShuffleScheduler.java:740)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.isShuffleHealthy(ShuffleScheduler.java:930)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:718)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupLocalDiskFetch(FetcherOrderedGrouped.java:696)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:175)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:191)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:54)
    ... 5 more
, errorMessage=Shuffle Runner Failed:org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: error in shuffle in Fetcher {scope_615} #3
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:301)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:285)
    at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Failed 15 times trying to download from scope-615_000003_00. threshold=15
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.isAbortLimitExceeedFor(ShuffleScheduler.java:740)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.isShuffleHealthy(ShuffleScheduler.java:930)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:718)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupLocalDiskFetch(FetcherOrderedGrouped.java:696)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:175)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:191)
    at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:54)
    ... 5 more
]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:51, Vertex vertex_1499292782644_0001_1_03 [scope-623] killed/failed due to:OWN_TASK_FAILURE]
Vertex killed, vertexName=scope-624, vertexId=vertex_1499292782644_0001_1_04, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:40, Vertex vertex_1499292782644_0001_1_04 [scope-624] killed/failed due to:OTHER_VERTEX_FAILURE]
Vertex killed, vertexName=scope-634, vertexId=vertex_1499292782644_0001_1_05, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:33, Vertex vertex_1499292782644_0001_1_05 [scope-634] killed/failed due to:OTHER_VERTEX_FAILURE]
Vertex killed, vertexName=scope-622, vertexId=vertex_1499292782644_0001_1_02, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:1, Vertex vertex_1499292782644_0001_1_02 [scope-622] killed/failed due to:OTHER_VERTEX_FAILURE]
Vertex killed, vertexName=scope-615, vertexId=vertex_1499292782644_0001_1_00, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:9, Vertex vertex_1499292782644_0001_1_00 [scope-615] killed/failed due to:OTHER_VERTEX_FAILURE]
DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:4, counters=Counters: 60
    org.apache.tez.common.counters.DAGCounter
        NUM_FAILED_TASKS=73
        NUM_KILLED_TASKS=213
        NUM_SUCCEEDED_TASKS=117
        TOTAL_LAUNCHED_TASKS=315
        RACK_LOCAL_TASKS=42
        AM_CPU_MILLISECONDS=2655350
        AM_GC_TIME_MILLIS=14774
    File System Counters
        FILE_BYTES_READ=2494679851
        FILE_BYTES_WRITTEN=4847503086
        FILE_READ_OPS=0
        FILE_LARGE_READ_OPS=0
        FILE_WRITE_OPS=0
        S3_BYTES_READ=11014106890
        S3_BYTES_WRITTEN=0
        S3_READ_OPS=0
        S3_LARGE_READ_OPS=0
        S3_WRITE_OPS=0
    org.apache.tez.common.counters.TaskCounter
        NUM_SPECULATIONS=71
        REDUCE_INPUT_GROUPS=320893
        REDUCE_INPUT_RECORDS=320958
        COMBINE_INPUT_RECORDS=0
        SPILLED_RECORDS=721418120
        NUM_SHUFFLED_INPUTS=494
        NUM_SKIPPED_INPUTS=0
        NUM_FAILED_SHUFFLE_INPUTS=0
        MERGED_MAP_OUTPUTS=494
        GC_TIME_MILLIS=113975
        CPU_MILLISECONDS=4706890
        PHYSICAL_MEMORY_BYTES=41962962944
        VIRTUAL_MEMORY_BYTES=212529692672
        COMMITTED_HEAP_BYTES=41962962944
        INPUT_RECORDS_PROCESSED=364674343
        INPUT_SPLIT_LENGTH_BYTES=11013650258
        OUTPUT_RECORDS=512016052
        OUTPUT_BYTES=10892611492
        OUTPUT_BYTES_WITH_OVERHEAD=5467887018
        OUTPUT_BYTES_PHYSICAL=2515253426
        ADDITIONAL_SPILLS_BYTES_WRITTEN=899759152
        ADDITIONAL_SPILLS_BYTES_READ=2332686174
        ADDITIONAL_SPILL_COUNT=56
        SHUFFLE_CHUNK_COUNT=78
        SHUFFLE_BYTES=4850739
        SHUFFLE_BYTES_DECOMPRESSED=7035446
        SHUFFLE_BYTES_TO_MEM=4223057
        SHUFFLE_BYTES_TO_DISK=0
        SHUFFLE_BYTES_DISK_DIRECT=627682
        NUM_MEM_TO_DISK_MERGES=0
        NUM_DISK_TO_DISK_MERGES=0
        SHUFFLE_PHASE_TIME=161720
        MERGE_PHASE_TIME=167514
        FIRST_EVENT_RECEIVED=3448
        LAST_EVENT_RECEIVED=154880
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    org.apache.hadoop.mapreduce.TaskCounter
        COMBINE_INPUT_RECORDS=191969
        COMBINE_OUTPUT_RECORDS=147020816
2017-07-06 11:26:11,379 INFO org.apache.hadoop.conf.Configuration.deprecation (PigTezLauncher-0): fs.default.name is deprecated. Instead, use fs.defaultFS
2017-07-06 11:26:11,430 INFO org.apache.pig.tools.pigstats.JobStats (PigTezLauncher-0): using output size reader: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.FileBasedOutputSizeReader
2017-07-06 11:26:11,553 WARN org.apache.pig.tools.pigstats.JobStats (PigTezLauncher-0): unable to find the output file
java.io.FileNotFoundException: File  does not exist.
    at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:972)
    at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:914)
    at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.listStatus(EmrFileSystem.java:337)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.FileBasedOutputSizeReader.getOutputSize(FileBasedOutputSizeReader.java:81)
    at org.apache.pig.tools.pigstats.JobStats.getOutputSize(JobStats.java:351)
    at org.apache.pig.tools.pigstats.tez.TezVertexStats.addOutputStatistics(TezVertexStats.java:324)
    at org.apache.pig.tools.pigstats.tez.TezVertexStats.accumulateStats(TezVertexStats.java:207)
    at org.apache.pig.tools.pigstats.tez.TezDAGStats.accumulateStats(TezDAGStats.java:238)
    at org.apache.pig.tools.pigstats.tez.TezPigScriptStats.accumulateStats(TezPigScriptStats.java:187)
    at org.apache.pig.backend.hadoop.executionengine.tez.TezJob.run(TezJob.java:243)
    at org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher$1.run(TezLauncher.java:210)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
2017-07-06 11:26:12,493 WARN org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher (main): Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
2017-07-06 11:26:12,505 INFO org.apache.pig.tools.pigstats.tez.TezPigScriptStats (main): Script Statistics:

       HadoopVersion: 2.7.3-amzn-2                                                                                        
          PigVersion: 0.16.0-amzn-0                                                                                       
          TezVersion: 0.8.4                                                                                               
              UserId: hadoop                                                                                              
            FileName: <path>/list_item_correlations.pig                   
           StartedAt: 2017-07-06 02:04:37                                                                                 
          FinishedAt: 2017-07-06 11:26:12                                                                                 
            Features: HASH_JOIN,GROUP_BY,FILTER,UNION                                                                     

Failed!


DAG 0:
                                    Name: PigLatin:correlations-0_scope-0                                                                     
                           ApplicationId: job_1499292782644_0001                                                                              
                      TotalLaunchedTasks: 315                                                                                                 
                           FileBytesRead: 2494679851                                                                                          
                        FileBytesWritten: 4847503086                                                                                          
                           HdfsBytesRead: 0                                                                                                   
                        HdfsBytesWritten: 0                                                                                                   
      SpillableMemoryManager spill count: 0                                                                                                   
                Bags proactively spilled: 0                                                                                                   
             Records proactively spilled: 0                                                                                                   

DAG Plan:
Tez vertex scope-615    ->  Tez vertex scope-619,Tez vertex scope-623,
Tez vertex scope-619    ->  Tez vertex scope-634,
Tez vertex scope-622    ->  Tez vertex scope-623,
Tez vertex scope-623    ->  Tez vertex scope-624,
Tez vertex scope-624    ->  Tez vertex scope-634,
Tez vertex scope-634

Vertex Stats:
VertexId Parallelism TotalTasks   InputRecords   ReduceInputRecords  OutputRecords  FileBytesRead FileBytesWritten  HdfsBytesRead HdfsBytesWritten Alias    Feature Outputs
scope-619         19         19              0               320958         320893        7701880          9427573              0                0 jnd,vote_counts  GROUP_BY    

Failed vertices:
VertexId  State Parallelism TotalTasks   InputRecords   ReduceInputRecords  OutputRecords  FileBytesRead FileBytesWritten  HdfsBytesRead HdfsBytesWritten Alias Feature Outputs
scope-615  KILLED       26         26      147020816                    0      294041632     1003592335       1936995311              0                0 grpd,pairs,vote_counts,vote_flags,votes1   MULTI_QUERY 
scope-622  KILLED       26         26      217653527                    0      217653527     1483385636       2901080202              0                0 pairs,votes2       
scope-623  FAILED       52         52              0                    0              0              0                0              0                0 counted,flags,grouped_flags,pairs,required_pairs   HASH_JOIN   
scope-624  KILLED       -1         40              0                    0              0              0                0              0                0 corr1,corr2,counted,fltrd,jnd  GROUP_BY,MULTI_QUERY    
scope-634  KILLED       -1         33              0                    0              0              0                0              0                0

EMR作业失败：由于提取失败太多且进度不足，Shuffle失败

0 个答案: