多次重试后如何调试地图作业失败的原因

时间:2017-06-16 00:59:10

标签: java mapreduce hbase yarn mapper

我写了一个mapreduce作业来扫描一个特定时间范围的hbase表,以计算我们需要分析的某些元素。

MR工作中的Mappers仍然失败,但我不知道为什么。似乎每次我运行作业时,不同数量的映射器都会失败。来自Cloudera经理的YARN日志(见下文)并没有帮助指出问题所在,尽管有人说我的内存可能不足。

它似乎重试多次,但每次失败。我需要做些什么才能让它停止失败,或者我如何记录事情以帮助我更好地确定发生了什么?

以下是YARN中一个失败的地图制作者的日志。

  

错误:org.apache.hadoop.hbase.client.RetriesExhaustedException:   在尝试= 36之后失败,例外:Thu Jun 15 16:26:57 PDT 2017,   null,java.net.SocketTimeoutException:callTimeout = 60000,   callDuration = 60301:row' 152_p3401.db161139.sjc102.dbi_1496271480'上   table' dbi_based_data'在   region = dbi_based_data,151_p3413.db162024.iad4.dbi_1476974340,1486675565213.d83250d0682e648d165872afe5abd60e。,hostname = hslave35118.ams9.mysecretdomain.com,60020,1483570489305,   seqNum = 19308931 at   org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:276)   在   org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:207)   在   org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:60)   在   org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)   在   org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:320)   在   org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:403)   在   org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:364)   在   org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:236)   在   org.apache.hadoop.hbase.mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:147)   在   org.apache.hadoop.hbase.mapreduce.TableInputFormatBase $ 1.nextKeyValue(TableInputFormatBase.java:216)   在   org.apache.hadoop.mapred.MapTask $ NewTrackingRecordReader.nextKeyValue(MapTask.java:556)   在   org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)   在   org.apache.hadoop.mapreduce.lib.map.WrappedMapper $ Context.nextKeyValue(WrappedMapper.java:91)   在org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)at   org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)at at   org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)at at   org.apache.hadoop.mapred.YarnChild $ 2.run(YarnChild.java:164)at at   java.security.AccessController.doPrivileged(Native Method)at   javax.security.auth.Subject.doAs(Subject.java:415)at   org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)   在org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)引起   by:java.net.SocketTimeoutException:callTimeout = 60000,   callDuration = 60301:row' 152_p3401.db161139.sjc102.dbi_1496271480'上   table' dbi_based_data'在   region = dbi_based_data,151_p3413.db162024.iad4.dbi_1476974340,1486675565213.d83250d0682e648d165872afe5abd60e。,hostname = hslave35118.ams9.mysecretdomain.com,60020,1483570489305,   seqNum = 19308931 at   org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:159)   在   org.apache.hadoop.hbase.client.ResultBoundedCompletionService $ QueueingFuture.run(ResultBoundedCompletionService.java:65)   在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)   在   java.util.concurrent.ThreadPoolExecutor中的$ Worker.run(ThreadPoolExecutor.java:615)   在java.lang.Thread.run(Thread.java:745)引起:   java.io.IOException:调用   hslave35118.ams9.mysecretdomain.com/10.216.35.118:60020失败了   本地异常:org.apache.hadoop.hbase.ipc.CallTimeoutException:   调用id = 12,waitTime = 60001,operationTimeout = 60000已过期。在   org.apache.hadoop.hbase.ipc.AbstractRpcClient.wrapException(AbstractRpcClient.java:291)   在   org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1272)   在   org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:226)   在   org.apache.hadoop.hbase.ipc.AbstractRpcClient $ BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:331)   在   org.apache.hadoop.hbase.protobuf.generated.ClientProtos $ ClientService $ BlockingStub.scan(ClientProtos.java:34094)   在   org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:219)   在   org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:64)   在   org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)   在   org.apache.hadoop.hbase.client.ScannerCallableWithReplicas $ RetryingRPC.call(ScannerCallableWithReplicas.java:360)   在   org.apache.hadoop.hbase.client.ScannerCallableWithReplicas $ RetryingRPC.call(ScannerCallableWithReplicas.java:334)   在   org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:126)   ... 4更多引起:   org.apache.hadoop.hbase.ipc.CallTimeoutException:调用id = 12,   waitTime = 60001,operationTimeout = 60000已过期。在   org.apache.hadoop.hbase.ipc.Call.checkAndSetTimeout(Call.java:73)at   org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1246)   ......还有13个

1 个答案:

答案 0 :(得分:0)

所以看起来我的情况需要扩展超时设置。在我的Java程序中,我不得不添加以下行以使异常消失:

    conf.set("hbase.rpc.timeout","90000");
    conf.set("hbase.client.scanner.timeout.period","90000");

找到了答案on this link on Cloudera's site