I am testing my hadoop cluster which consists of 4 docker containers:
When I submit a map reduce job I notice connection issues once both map and reduce are at 100%. This then reaches the maximum number of re-tries before erroring and providing a stack trace. The weird thing is that the job finishes and provides an answer. However the node manager web interface shows a failed job. None of the question/answers I have found so far fix my particular issue.
All my machines have exposed the port range 50100:50200 to comply with the 'yarn.app.mapreduce.am.job.client.port-range' property.
The job I submit is
sudo -u hdfs hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.7.1.jar pi 1 1
This is the output:
Number of Maps = 1
Samples per Map = 1
Wrote input for Map #0
Starting Job
16/06/18 19:14:07 INFO client.RMProxy: Connecting to ResourceManager at resource-manager/172.19.0.2:8032
16/06/18 19:14:08 INFO input.FileInputFormat: Total input paths to process : 1
16/06/18 19:14:08 INFO mapreduce.JobSubmitter: number of splits:1
16/06/18 19:14:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1466277178029_0001
16/06/18 19:14:08 INFO impl.YarnClientImpl: Submitted application application_1466277178029_0001
16/06/18 19:14:08 INFO mapreduce.Job: The url to track the job: http://resource-manager:8088/proxy/application_1466277178029_0001/
16/06/18 19:14:08 INFO mapreduce.Job: Running job: job_1466277178029_0001
16/06/18 19:14:15 INFO mapreduce.Job: Job job_1466277178029_0001 running in uber mode : false
16/06/18 19:14:15 INFO mapreduce.Job: map 0% reduce 0%
16/06/18 19:14:19 INFO mapreduce.Job: map 100% reduce 0%
16/06/18 19:14:26 INFO mapreduce.Job: map 100% reduce 100%
16/06/18 19:14:32 INFO ipc.Client: Retrying connect to server: 01d3c03f829a/172.19.0.4:50100. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
16/06/18 19:14:33 INFO ipc.Client: Retrying connect to server: 01d3c03f829a/172.19.0.4:50100. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
16/06/18 19:14:34 INFO ipc.Client: Retrying connect to server: 01d3c03f829a/172.19.0.4:50100. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
16/06/18 19:14:36 INFO mapreduce.Job: map 0% reduce 0%
16/06/18 19:14:36 INFO mapreduce.Job: Job job_1466277178029_0001 failed with state FAILED due to: Application application_1466277178029_0001 failed 2 times due to AM Container for appattempt_1466277178029_0001_000002 exited with exitCode: 1
For more detailed output, check application tracking page:http://resource-manager:8088/proxy/application_1466277178029_0001/AThen, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1466277178029_0001_02_000001
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:561)
at org.apache.hadoop.util.Shell.run(Shell.java:478)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.
16/06/18 19:14:36 INFO mapreduce.Job: Counters: 0
Job Finished in 28.862 seconds
Estimated value of Pi is 4.00000000000000000000
the container log has the following:
2016-06-18 19:14:32,273 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Created MRAppMaster for application appattempt_1466277178029_0001_000002
2016-06-18 19:14:32,443 WARN [main] org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-06-18 19:14:32,475 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Executing with tokens:
2016-06-18 19:14:32,477 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Kind: YARN_AM_RM_TOKEN, Service: , Ident: (org.apache.hadoop.yarn.security.AMRMTokenIdentifier@3514a4c0)
2016-06-18 19:14:32,515 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Using mapred newApiCommitter.
2016-06-18 19:14:33,060 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Attempt num: 2 is last retry: true because a commit was started.
2016-06-18 19:14:33,061 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.v2.app.job.event.JobEventType for class org.apache.hadoop.mapreduce.v2.app.MRAppMaster$NoopEventHandler
2016-06-18 19:14:33,067 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.jobhistory.EventType for class org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler
2016-06-18 19:14:33,068 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.v2.app.rm.ContainerAllocator$EventType for class org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter
2016-06-18 19:14:33,118 INFO [main] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system is set solely by core-default.xml therefore - ignoring
2016-06-18 19:14:33,141 INFO [main] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system is set solely by core-default.xml therefore - ignoring
2016-06-18 19:14:33,162 INFO [main] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system is set solely by core-default.xml therefore - ignoring
2016-06-18 19:14:33,183 INFO [main] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Emitting job history data to the timeline server is not enabled
2016-06-18 19:14:33,185 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Will not try to recover. recoveryEnabled: true recoverySupportedByCommitter: false numReduceTasks: 1 shuffleKeyValidForRecovery: true ApplicationAttemptID: 2
2016-06-18 19:14:33,210 INFO [main] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system is set solely by core-default.xml therefore - ignoring
2016-06-18 19:14:33,212 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Previous history file is at hdfs://namenode:9000/user/hdfs/.staging/job_1466277178029_0001/job_1466277178029_0001_1.jhist
2016-06-18 19:14:33,621 INFO [main] org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.mapreduce.v2.app.job.event.JobFinishEvent$Type for class org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler
2016-06-18 19:14:33,640 WARN [main] org.apache.hadoop.metrics2.impl.MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-mrappmaster.properties,hadoop-metrics2.properties
2016-06-18 19:14:33,689 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2016-06-18 19:14:33,689 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MRAppMaster metrics system started
2016-06-18 19:14:33,708 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: nodeBlacklistingEnabled:true
2016-06-18 19:14:33,708 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: maxTaskFailuresPerNode is 3
2016-06-18 19:14:33,708 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: blacklistDisablePercent is 33
2016-06-18 19:14:33,739 INFO [main] org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at resource-manager/172.19.0.2:8030
2016-06-18 19:14:33,814 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: maxContainerCapability: <memory:4096, vCores:4>
2016-06-18 19:14:33,814 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: queue: root.hdfs
2016-06-18 19:14:33,837 INFO [main] org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file system is set solely by core-default.xml therefore - ignoring
2016-06-18 19:14:33,840 INFO [main] org.apache.hadoop.mapreduce.jobhistory.JobHistoryCopyService: History file is at hdfs://namenode:9000/user/hdfs/.staging/job_1466277178029_0001/job_1466277178029_0001_1.jhist
2016-06-18 19:14:33,894 INFO [eventHandlingThread] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Event Writer setup for JobId: job_1466277178029_0001, File: hdfs://namenode:9000/user/hdfs/.staging/job_1466277178029_0001/job_1466277178029_0001_2.jhist
2016-06-18 19:14:33,959 WARN [main] org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hdfs (auth:SIMPLE) cause:java.io.IOException: Was asked to shut down.
2016-06-18 19:14:33,959 FATAL [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster
java.io.IOException: Was asked to shut down.
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$4.run(MRAppMaster.java:1546)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1540)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1473)
2016-06-18 19:14:33,962 INFO [main] org.apache.hadoop.util.ExitUtil: Exiting with status 1
A few times it says 'Cannot locate configuration' or 'Default file system is set solely by core-default.xml'. Is this significant? In case this changes anything I am using the cloudera repo to install various hadoop services instead of unpacking a .tar.gz.
My config files are:
core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://namenode:9000</value>
</property>
<property>
<name>hadoop.proxyuser.mapred.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.mapred.hosts</name>
<value>*</value>
</property>
</configuration>
yar-site.xml
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>resource-manager</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>resource-manager:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>resource-manager:8030</value>
</property>
<property>
<description>Classpath for typical applications.</description>
<name>yarn.application.classpath</name>
<value>
$HADOOP_CONF_DIR,
$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
$HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*
</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>file:///data/1/yarn/local,file:///data/2/yarn/local,file:///data/3/yarn/local</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>file:///data/1/yarn/logs,file:///data/2/yarn/logs,file:///data/3/yarn/logs</value>
</property>
<property>
<name>yarn.log.aggregation-enable</name>
<value>true</value>
</property>
<property>
<description>Where to aggregate logs</description>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>hdfs://namenode:8020/var/log/hadoop-yarn/apps</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>resource-manager:8088</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>resource-manager:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>resource-manager:8033</value>
</property>
<property>
<name>yarn.nodemanager.delete.debug-delay-sec</name>
<value>600</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value>
<description>Amount of physical memory, in MB, that can be allocated for containers.</description>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1000</value>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>namenode:8021</value>
</property>
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/user</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>history-server:10020</value>
<description>Enter your JobHistoryServer hostname.</description>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>history-server:19888</value>
<description>Enter your JobHistoryServer hostname.</description>
</property>
<property>
<name>yarn.app.mapreduce.am.job.client.port-range</name>
<value>50100-50200</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.permissions.superusergroup</name>
<value>hadoop</value>
</property>
<property>
<name>dfs.name.dir or dfs.namenode.name.dir</name>
<value>file:///data/1/dfs/nn,file:///nfsmount/dfs/nn</value>
</property>
<property>
<name>dfs.data.dir or dfs.datanode.data.dir</name>
<value>file:///data/1/dfs/dn,file:///data/2/dfs/dn,file:///data/3/dfs/dn,file:///data/4/dfs/dn</value>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>namenode:50070</value>
<description>
The address and the base port on which the dfs NameNode Web UI will listen.
</description>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
</configuration>
Thanks for reading.
答案 0 :(得分:0)
对于遇到相同问题的任何人,解决方案是将以下内容添加到hdfs-site.xml:
<property>
<name>dfs.safemode.threshold.pct</name>
<value>0</value>
</property>