Question

I am testing my hadoop cluster which consists of 4 docker containers:

Datanode
Secondary Namenode
Namenode
Resource Manager

When I submit a map reduce job I notice connection issues once both map and reduce are at 100%. This then reaches the maximum number of re-tries before erroring and providing a stack trace. The weird thing is that the job finishes and provides an answer. However the node manager web interface shows a failed job. None of the question/answers I have found so far fix my particular issue.

All my machines have exposed the port range 50100:50200 to comply with the 'yarn.app.mapreduce.am.job.client.port-range' property.

The job I submit is

sudo -u hdfs hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.7.1.jar pi 1 1

This is the output:

    Number of Maps  = 1
    Samples per Map = 1
    Wrote input for Map #0
    Starting Job
    16/06/18 19:14:07 INFO client.RMProxy: Connecting to ResourceManager at resource-manager/172.19.0.2:8032
    16/06/18 19:14:08 INFO input.FileInputFormat: Total input paths to process : 1
    16/06/18 19:14:08 INFO mapreduce.JobSubmitter: number of splits:1
    16/06/18 19:14:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1466277178029_0001
    16/06/18 19:14:08 INFO impl.YarnClientImpl: Submitted application application_1466277178029_0001
    16/06/18 19:14:08 INFO mapreduce.Job: The url to track the job: http://resource-manager:8088/proxy/application_1466277178029_0001/
    16/06/18 19:14:08 INFO mapreduce.Job: Running job: job_1466277178029_0001
    16/06/18 19:14:15 INFO mapreduce.Job: Job job_1466277178029_0001 running in uber mode : false
    16/06/18 19:14:15 INFO mapreduce.Job:  map 0% reduce 0%
    16/06/18 19:14:19 INFO mapreduce.Job:  map 100% reduce 0%
    16/06/18 19:14:26 INFO mapreduce.Job:  map 100% reduce 100%
    16/06/18 19:14:32 INFO ipc.Client: Retrying connect to server: 01d3c03f829a/172.19.0.4:50100. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
    16/06/18 19:14:33 INFO ipc.Client: Retrying connect to server: 01d3c03f829a/172.19.0.4:50100. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
    16/06/18 19:14:34 INFO ipc.Client: Retrying connect to server: 01d3c03f829a/172.19.0.4:50100. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
    16/06/18 19:14:36 INFO mapreduce.Job:  map 0% reduce 0%
    16/06/18 19:14:36 INFO mapreduce.Job: Job job_1466277178029_0001 failed with state FAILED due to: Application application_1466277178029_0001 failed 2 times due to AM Container for appattempt_1466277178029_0001_000002 exited with  exitCode: 1
    For more detailed output, check application tracking page:http://resource-manager:8088/proxy/application_1466277178029_0001/AThen, click on links to logs of each attempt.
    Diagnostics: Exception from container-launch.
    Container id: container_1466277178029_0001_02_000001
    Exit code: 1
    Stack trace: ExitCodeException exitCode=1: 
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:561)
        at org.apache.hadoop.util.Shell.run(Shell.java:478)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)


    Container exited with a non-zero exit code 1
    Failing this attempt. Failing the application.
    16/06/18 19:14:36 INFO mapreduce.Job: Counters: 0
    Job Finished in 28.862 seconds
    Estimated value of Pi is 4.00000000000000000000

the container log has the following:

    2016-06-18 19:14:32,273 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Created MRAppMaster for application appattempt_1466277178029_0001_000002
    2016-06-18 19:14:32,443 WARN [main] org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    2016-06-18 19:14:32,475 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Executing with tokens:
    2016-06-18 19:14:32,477 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Kind: YARN_AM_RM_TOKEN, Service: , Ident: (org.apache.hadoop.yarn.security.AMRMTokenIdentifier@3514a4c0)
    2016-06-18 19:14:32,515 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Using mapred newApiCommitter.
    2016-06-18 19:14:33,060 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Attempt num: 2 is last retry: true because a commit was started.
    2016-06-18 19:14:33,061 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.v2.app.job.event.JobEventType for class org.apache.hadoop.mapreduce.v2.app.MRAppMaster$NoopEventHandler
    2016-06-18 19:14:33,067 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.jobhistory.EventType for class org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler
    2016-06-18 19:14:33,068 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.v2.app.rm.ContainerAllocator$EventType for class org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter
    2016-06-18 19:14:33,118 INFO [main] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system is set solely by core-default.xml therefore -  ignoring
    2016-06-18 19:14:33,141 INFO [main] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system is set solely by core-default.xml therefore -  ignoring
    2016-06-18 19:14:33,162 INFO [main] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system is set solely by core-default.xml therefore -  ignoring
    2016-06-18 19:14:33,183 INFO [main] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Emitting job history data to the timeline server is not enabled
    2016-06-18 19:14:33,185 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Will not try to recover. recoveryEnabled: true recoverySupportedByCommitter: false numReduceTasks: 1 shuffleKeyValidForRecovery: true ApplicationAttemptID: 2
    2016-06-18 19:14:33,210 INFO [main] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system is set solely by core-default.xml therefore -  ignoring
    2016-06-18 19:14:33,212 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Previous history file is at hdfs://namenode:9000/user/hdfs/.staging/job_1466277178029_0001/job_1466277178029_0001_1.jhist
    2016-06-18 19:14:33,621 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.v2.app.job.event.JobFinishEvent$Type for class org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler
    2016-06-18 19:14:33,640 WARN [main] org.apache.hadoop.metrics2.impl.MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-mrappmaster.properties,hadoop-metrics2.properties
    2016-06-18 19:14:33,689 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
    2016-06-18 19:14:33,689 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MRAppMaster metrics system started
    2016-06-18 19:14:33,708 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: nodeBlacklistingEnabled:true
    2016-06-18 19:14:33,708 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: maxTaskFailuresPerNode is 3
    2016-06-18 19:14:33,708 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: blacklistDisablePercent is 33
    2016-06-18 19:14:33,739 INFO [main] org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at resource-manager/172.19.0.2:8030
    2016-06-18 19:14:33,814 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: maxContainerCapability: <memory:4096, vCores:4>
    2016-06-18 19:14:33,814 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: queue: root.hdfs
    2016-06-18 19:14:33,837 INFO [main] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system is set solely by core-default.xml therefore -  ignoring
    2016-06-18 19:14:33,840 INFO [main] org.apache.hadoop.mapreduce.jobhistory.JobHistoryCopyService: History file is at hdfs://namenode:9000/user/hdfs/.staging/job_1466277178029_0001/job_1466277178029_0001_1.jhist
    2016-06-18 19:14:33,894 INFO [eventHandlingThread] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Event Writer setup for JobId: job_1466277178029_0001, File: hdfs://namenode:9000/user/hdfs/.staging/job_1466277178029_0001/job_1466277178029_0001_2.jhist
    2016-06-18 19:14:33,959 WARN [main] org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hdfs (auth:SIMPLE) cause:java.io.IOException: Was asked to shut down.
    2016-06-18 19:14:33,959 FATAL [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster
    java.io.IOException: Was asked to shut down.
        at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$4.run(MRAppMaster.java:1546)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
        at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1540)
        at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1473)
    2016-06-18 19:14:33,962 INFO [main] org.apache.hadoop.util.ExitUtil: Exiting with status 1

A few times it says 'Cannot locate configuration' or 'Default file system is set solely by core-default.xml'. Is this significant? In case this changes anything I am using the cloudera repo to install various hadoop services instead of unpacking a .tar.gz.

My config files are:

core-site.xml

    <configuration>
      <property>
        <name>fs.defaultFS</name>
        <value>hdfs://namenode:9000</value>
      </property>
      <property>
        <name>hadoop.proxyuser.mapred.groups</name>
        <value>*</value>
      </property>
      <property>
        <name>hadoop.proxyuser.mapred.hosts</name>
        <value>*</value>
      </property>
    </configuration>

yar-site.xml

    <configuration>
      <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>resource-manager</value>
      </property>
      <property>
        <name>yarn.resourcemanager.address</name>
        <value>resource-manager:8032</value>
      </property>
      <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>resource-manager:8030</value>
      </property>
      <property>
      <description>Classpath for typical applications.</description>
        <name>yarn.application.classpath</name>
        <value>
          $HADOOP_CONF_DIR,
          $HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
          $HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
          $HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
          $HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*
        </value>
      </property>
      <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
      </property>
      <property>
        <name>yarn.nodemanager.local-dirs</name>
        <value>file:///data/1/yarn/local,file:///data/2/yarn/local,file:///data/3/yarn/local</value>
      </property>
      <property>
        <name>yarn.nodemanager.log-dirs</name>
        <value>file:///data/1/yarn/logs,file:///data/2/yarn/logs,file:///data/3/yarn/logs</value>
      </property>
      <property>
        <name>yarn.log.aggregation-enable</name>
        <value>true</value>
      </property>
      <property>
        <description>Where to aggregate logs</description>
        <name>yarn.nodemanager.remote-app-log-dir</name>
        <value>hdfs://namenode:8020/var/log/hadoop-yarn/apps</value>
      </property>
      <property>
        <name>yarn.resourcemanager.webapp.address</name>
        <value>resource-manager:8088</value>
      </property>
      <property>
        <name>yarn.resourcemanager.resource-tracker.address</name>
        <value>resource-manager:8031</value>
      </property>
      <property>
        <name>yarn.resourcemanager.admin.address</name>
        <value>resource-manager:8033</value>
      </property>
      <property>
        <name>yarn.nodemanager.delete.debug-delay-sec</name>
        <value>600</value>
      </property>
      <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>4096</value>
      <description>Amount of physical memory, in MB, that can be allocated for containers.</description>
      </property>
      <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>1000</value>
      </property>
    </configuration>

mapred-site.xml

    <configuration>
      <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
      </property>
      <property>
        <name>mapred.job.tracker</name>
        <value>namenode:8021</value>
      </property>
      <property>
        <name>yarn.app.mapreduce.am.staging-dir</name>
        <value>/user</value>
      </property>
      <property>
        <name>mapreduce.jobhistory.address</name>
        <value>history-server:10020</value>
        <description>Enter your JobHistoryServer hostname.</description>
      </property>
      <property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>history-server:19888</value>
        <description>Enter your JobHistoryServer hostname.</description>
      </property>
      <property>
        <name>yarn.app.mapreduce.am.job.client.port-range</name>
        <value>50100-50200</value>
      </property>
    </configuration>

hdfs-site.xml

    <configuration>
      <property>
        <name>dfs.permissions.superusergroup</name>
        <value>hadoop</value>
      </property>
      <property>
        <name>dfs.name.dir or dfs.namenode.name.dir</name>
        <value>file:///data/1/dfs/nn,file:///nfsmount/dfs/nn</value>
      </property>
      <property>
        <name>dfs.data.dir or dfs.datanode.data.dir</name>
        <value>file:///data/1/dfs/dn,file:///data/2/dfs/dn,file:///data/3/dfs/dn,file:///data/4/dfs/dn</value>
      </property>
      <property>
        <name>dfs.namenode.http-address</name>
        <value>namenode:50070</value>
        <description>
        The address and the base port on which the dfs NameNode Web UI will listen.
        </description>
      </property>
      <property>
        <name>dfs.webhdfs.enabled</name>
        <value>true</value>
      </property>
    </configuration>

Thanks for reading.

Answer 1

对于遇到相同问题的任何人，解决方案是将以下内容添加到hdfs-site.xml：

  <property>
   <name>dfs.safemode.threshold.pct</name>
   <value>0</value>
 </property>

Mapreduce job ipc.Client retrying to connect

1 个答案: