我安装了Apache Samza,它使用Yarn来管理作业。它在虚拟机上的两个Debian服务器上运行。 Samza是0.9.1版。 Hadoop是2.6.0版。我看到两个不同的问题,我不确定它们是否相关,但看起来Yarn都没有做它应该做的事情。
纱-site.xml中:
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>kfk-samza01</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>128</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>3</value>
</property>
</configuration>
在我配置的作业选项文件中添加了以下内容:
yarn.container.memory.mb=256
yarn.am.container.memory.mb=256
task.opts= -Xms128M -Xmx128M
当作业运行时,我可以看到-Xms128M -Xmx128M选项被忽略并使用默认值。
我看到以下错误。看起来一些内存限制阻止了从Accepted到Running的工作,但是我找不到如何解决它。
Container [pid=23007,containerID=container_1443454508386_0003_01_000001] is running beyond virtual memory limits. Current usage: 13.9 MB of 256 MB physical memory used; 1.1 GB of 537.6 MB virtual memory used. Killing container
实际上,作业只是干净的功能,所以我的代码都不应该引入噪音。
知道问题是什么?
更新: 在ACCEPTED状态下停留约10分钟后,它将进入失败状态。 以下是我在yarn-root-resourcemanager-kfk-samza01.out日志中看到的内容的一部分:
2015-09-30 14:08:07,000 INFO [ResourceManager Event Processor] resourcemanager.RMAuditLogger (RMAuditLogger.java:logSuccess(106)) - USER=root OPERATION=AM Allocated Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1443613686881_0001 CONTAINERID=container_1443613686881_0001_02_000001
2015-09-30 14:08:07,000 INFO [ResourceManager Event Processor] scheduler.SchedulerNode (SchedulerNode.java:allocateContainer(153)) - Assigned container container_1443613686881_0001_02_000001 of capacity <memory:1024, vCores:1> on host kfk-samza01:44816, which has 1 containers, <memory:1024, vCores:1> used and <memory:7168, vCores:7> available after allocation
2015-09-30 14:08:07,001 INFO [ResourceManager Event Processor] capacity.LeafQueue (LeafQueue.java:assignContainer(1580)) - assignedContainer application attempt=appattempt_1443613686881_0001_000002 container=Container: [ContainerId: container_1443613686881_0001_02_000001, NodeId: kfk-samza01:44816, NodeHttpAddress: kfk-samza01:8042, Resource: <memory:1024, vCores:1>, Priority: 0, Token: null, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:0, vCores:0>, usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=1, numContainers=0 clusterResource=<memory:16384, vCores:16>
2015-09-30 14:08:07,002 INFO [ResourceManager Event Processor] capacity.ParentQueue (ParentQueue.java:assignContainersToChildQueues(559)) - Re-sorting assigned queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:1024, vCores:1>, usedCapacity=0.0625, absoluteUsedCapacity=0.0625, numApps=1, numContainers=1
2015-09-30 14:08:07,002 INFO [ResourceManager Event Processor] capacity.ParentQueue (ParentQueue.java:assignContainers(424)) - assignedContainer queue=root usedCapacity=0.0625 absoluteUsedCapacity=0.0625 used=<memory:1024, vCores:1> cluster=<memory:16384, vCores:16>
2015-09-30 14:08:07,005 INFO [AsyncDispatcher event handler] security.NMTokenSecretManagerInRM (NMTokenSecretManagerInRM.java:createAndGetNMToken(200)) - Sending NMToken for nodeId : kfk-samza01:44816 for container : container_1443613686881_0001_02_000001
2015-09-30 14:08:07,009 INFO [AsyncDispatcher event handler] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(408)) - container_1443613686881_0001_02_000001 Container Transitioned from ALLOCATED to ACQUIRED
2015-09-30 14:08:07,009 INFO [AsyncDispatcher event handler] security.NMTokenSecretManagerInRM (NMTokenSecretManagerInRM.java:clearNodeSetForAttempt(146)) - Clear node set for appattempt_1443613686881_0001_000002
2015-09-30 14:08:07,010 INFO [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:storeAttempt(1830)) - Storing attempt: AppId: application_1443613686881_0001 AttemptId: appattempt_1443613686881_0001_000002 MasterContainer: Container: [ContainerId: container_1443613686881_0001_02_000001, NodeId: kfk-samza01:44816, NodeHttpAddress: kfk-samza01:8042, Resource: <memory:1024, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 192.168.15.92:44816 }, ]
2015-09-30 14:08:07,010 INFO [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(762)) - appattempt_1443613686881_0001_000002 State change from SCHEDULED to ALLOCATED_SAVING
2015-09-30 14:08:07,011 INFO [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(762)) - appattempt_1443613686881_0001_000002 State change from ALLOCATED_SAVING to ALLOCATED
2015-09-30 14:08:07,012 INFO [pool-1-thread-3] amlauncher.AMLauncher (AMLauncher.java:run(253)) - Launching masterappattempt_1443613686881_0001_000002
2015-09-30 14:08:07,018 INFO [pool-1-thread-3] amlauncher.AMLauncher (AMLauncher.java:launch(106)) - Setting up container Container: [ContainerId: container_1443613686881_0001_02_000001, NodeId: kfk-samza01:44816, NodeHttpAddress: kfk-samza01:8042, Resource: <memory:1024, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 192.168.15.92:44816 }, ] for AM appattempt_1443613686881_0001_000002
2015-09-30 14:08:07,019 INFO [pool-1-thread-3] amlauncher.AMLauncher (AMLauncher.java:createAMContainerLaunchContext(191)) - Command to launch container container_1443613686881_0001_02_000001 : export SAMZA_LOG_DIR=<LOG_DIR> && ln -sfn <LOG_DIR> logs && exec ./__package/bin/run-am.sh 1>logs/stdout 2>logs/stderr
2015-09-30 14:08:07,020 INFO [pool-1-thread-3] security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:createAndGetAMRMToken(195)) - Create AMRMToken for ApplicationAttempt: appattempt_1443613686881_0001_000002
2015-09-30 14:08:07,020 INFO [pool-1-thread-3] security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:createPassword(307)) - Creating password for appattempt_1443613686881_0001_000002
2015-09-30 14:08:07,064 INFO [pool-1-thread-3] amlauncher.AMLauncher (AMLauncher.java:launch(127)) - Done launching container Container: [ContainerId: container_1443613686881_0001_02_000001, NodeId: kfk-samza01:44816, NodeHttpAddress: kfk-samza01:8042, Resource: <memory:1024, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 192.168.15.92:44816 }, ] for AM appattempt_1443613686881_0001_000002
2015-09-30 14:08:07,065 INFO [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(762)) - appattempt_1443613686881_0001_000002 State change from ALLOCATED to LAUNCHED
2015-09-30 14:08:08,001 INFO [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(408)) - container_1443613686881_0001_02_000001 Container Transitioned from ACQUIRED to RUNNING
2015-09-30 14:21:26,930 INFO [Ping Checker] util.AbstractLivelinessMonitor (AbstractLivelinessMonitor.java:run(127)) - Expired:appattempt_1443613686881_0001_000002 Timed out after 600 secs
2015-09-30 14:21:26,931 INFO [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:rememberTargetTransitionsAndStoreState(1125)) - Updating application attempt appattempt_1443613686881_0001_000002 with final state: FAILED, and exit status: -1000
2015-09-30 14:21:26,931 INFO [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(762)) - appattempt_1443613686881_0001_000002 State change from LAUNCHED to FINAL_SAVING
2015-09-30 14:21:26,932 INFO [AsyncDispatcher event handler] resourcemanager.ApplicationMasterService (ApplicationMasterService.java:unregisterAttempt(677)) - Unregistering app attempt : appattempt_1443613686881_0001_000002
2015-09-30 14:21:26,932 INFO [AsyncDispatcher event handler] security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:applicationMasterFinished(124)) - Application finished, removing password for appattempt_1443613686881_0001_000002
2015-09-30 14:21:26,933 INFO [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(762)) - appattempt_1443613686881_0001_000002 State change from FINAL_SAVING to FAILED
2015-09-30 14:21:26,933 INFO [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:transition(1208)) - The number of failed attempts is 2. The max attempts is 2
2015-09-30 14:21:26,935 INFO [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:rememberTargetTransitionsAndStoreState(995)) - Updating application application_1443613686881_0001 with final state: FAILED
2015-09-30 14:21:26,937 INFO [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:handle(721)) - application_1443613686881_0001 State change from ACCEPTED to FINAL_SAVING
2015-09-30 14:21:26,938 INFO [ResourceManager Event Processor] capacity.CapacityScheduler (CapacityScheduler.java:doneApplicationAttempt(790)) - Application Attempt appattempt_1443613686881_0001_000002 is done. finalState=FAILED
2015-09-30 14:21:26,938 INFO [AsyncDispatcher event handler] recovery.RMStateStore (RMStateStore.java:transition(161)) - Updating info for app: application_1443613686881_0001
2015-09-30 14:21:26,939 INFO [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(408)) - container_1443613686881_0001_02_000001 Container Transitioned from RUNNING to KILLED
2015-09-30 14:21:26,939 INFO [ResourceManager Event Processor] fica.FiCaSchedulerApp (FiCaSchedulerApp.java:containerCompleted(113)) - Completed container: container_1443613686881_0001_02_000001 in state: KILLED event:KILL
2015-09-30 14:21:26,939 INFO [ResourceManager Event Processor] resourcemanager.RMAuditLogger (RMAuditLogger.java:logSuccess(106)) - USER=root OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1443613686881_0001 CONTAINERID=container_1443613686881_0001_02_000001
2015-09-30 14:21:26,940 INFO [ResourceManager Event Processor] scheduler.SchedulerNode (SchedulerNode.java:releaseContainer(216)) - Released container container_1443613686881_0001_02_000001 of capacity <memory:1024, vCores:1> on host kfk-samza01:44816, which currently has 0 containers, <memory:0, vCores:0> used and <memory:8192, vCores:8> available, release resources=true
2015-09-30 14:21:26,940 INFO [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:transition(945)) - Application application_1443613686881_0001 failed 2 times due to ApplicationMaster for attempt appattempt_1443613686881_0001_000002 timed out. Failing the application.
2015-09-30 14:21:26,940 INFO [ResourceManager Event Processor] capacity.LeafQueue (LeafQueue.java:releaseResource(1732)) - default used=<memory:0, vCores:0> numContainers=0 user=root user-resources=<memory:0, vCores:0>
2015-09-30 14:21:26,943 INFO [ResourceManager Event Processor] capacity.LeafQueue (LeafQueue.java:completedContainer(1683)) - completedContainer container=Container: [ContainerId: container_1443613686881_0001_02_000001, NodeId: kfk-samza01:44816, NodeHttpAddress: kfk-samza01:8042, Resource: <memory:1024, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 192.168.15.92:44816 }, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:0, vCores:0>, usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=1, numContainers=0 cluster=<memory:16384, vCores:16>
2015-09-30 14:21:26,943 INFO [ResourceManager Event Processor] capacity.ParentQueue (ParentQueue.java:completedContainer(604)) - completedContainer queue=root usedCapacity=0.0 absoluteUsedCapacity=0.0 used=<memory:0, vCores:0> cluster=<memory:16384, vCores:16>
2015-09-30 14:21:26,944 INFO [ResourceManager Event Processor] capacity.ParentQueue (ParentQueue.java:completedContainer(622)) - Re-sorting completed queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:0, vCores:0>, usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=1, numContainers=0
2015-09-30 14:21:26,944 INFO [ResourceManager Event Processor] capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1274)) - Application attempt appattempt_1443613686881_0001_000002 released container container_1443613686881_0001_02_000001 on node: host: kfk-samza01:44816 #containers=0 available=8192 used=0 with event: KILL
2015-09-30 14:21:26,945 INFO [ResourceManager Event Processor] scheduler.AppSchedulingInfo (AppSchedulingInfo.java:clearRequests(115)) - Application application_1443613686881_0001 requests cleared
2015-09-30 14:21:26,945 INFO [ResourceManager Event Processor] capacity.LeafQueue (LeafQueue.java:removeApplicationAttempt(682)) - Application removed - appId: application_1443613686881_0001 user: root queue: default #user-pending-applications: 0 #user-active-applications: 0 #queue-pending-applications: 0 #queue-active-applications: 0
2015-09-30 14:21:26,946 INFO [pool-1-thread-4] amlauncher.AMLauncher (AMLauncher.java:run(267)) - Cleaning master appattempt_1443613686881_0001_000002
2015-09-30 14:21:26,948 INFO [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:handle(721)) - application_1443613686881_0001 State change from FINAL_SAVING to FAILED
2015-09-30 14:21:26,949 INFO [ResourceManager Event Processor] capacity.ParentQueue (ParentQueue.java:removeApplication(372)) - Application removed - appId: application_1443613686881_0001 user: root leaf-queue of parent: root #applications: 0
2015-09-30 14:21:26,951 WARN [AsyncDispatcher event handler] resourcemanager.RMAuditLogger (RMAuditLogger.java:logFailure(263)) - USER=root OPERATION=Application Finished - Failed TARGET=RMAppManager RESULT=FAILURE DESCRIPTION=App failed with state: FAILED PERMISSIONS=Application application_1443613686881_0001 failed 2 times due to ApplicationMaster for attempt appattempt_1443613686881_0001_000002 timed out. Failing the application. APPID=application_1443613686881_0001
2015-09-30 14:21:26,955 INFO [AsyncDispatcher event handler] resourcemanager.RMAppManager$ApplicationSummary (RMAppManager.java:logAppSummary(179)) - appId=application_1443613686881_0001,name=flow.Router_1,user=root,queue=default,state=FAILED,trackingUrl=http://kfk-samza01:8088/cluster/app/application_1443613686881_0001,appMasterHost=N/A,startTime=1443614243319,finishTime=1443615686935,finalStatus=FAILED
有关正在发生的事情的任何线索?
答案 0 :(得分:1)
请尝试使用以下作业配置属性来限制容器内存分配。
mapreduce.map.memory.mb
mapreduce.reduce.memory.mb
根据您的情况,这两个属性值可以是256MB
。
并配置以下两个属性
mapreduce.map.java.opts
mapreduce.reduce.java.opts
根据您的情况,这两个属性的值应为128MB
。
[注意:以上两个*.java.opts
值必须略低于相应的*.memory.mb
属性
如果您仍然继续遇到虚拟内存问题,请尝试通过配置以下属性来降低虚拟内存分配的配给值。
yarn.nodemanager.vmem-pmem-ratio
默认为2.1
,如果您仍然遇到虚拟内存问题,请尝试减少它。
正确设置这些属性后,您将在成功完成后清除容器。
希望这有帮助。
答案 1 :(得分:1)
最后,我有两个并行的问题。一,hserus已经解释过已经解决的内存限制。
另一个是kafka服务器的通信问题,这些服务器已经引发了主题的破坏,因此作业无法运行。