纱线java进程未被杀死

时间:2015-09-29 07:55:26

标签: hadoop yarn apache-samza

我安装了Apache Samza,它使用Yarn来管理作业。它在虚拟机上的两个Debian服务器上运行。 Samza是0.9.1版。 Hadoop是2.6.0版。我看到两个不同的问题,我不确定它们是否相关,但看起来Yarn都没有做它应该做的事情。

  • 当我尝试使用samza(kill-yarn-job.sh)提供的脚本杀死作业时,我在Web界面中看到作业将状态从Running或Accepted更改为Killed,但java进程仍在运行。经过很长一段时间,杀死他们的唯一方法就是这么做:杀死-9。
  • 虽然我一直在改变yarn-site.xml值,但我不能只运行一个作业。 我的机器有4 Gb内存和4个cpu内核。这是
  • 的内容

纱-site.xml中:

<configuration>
<property>
 <name>yarn.resourcemanager.hostname</name>
 <value>kfk-samza01</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>128</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>3</value>
</property>
</configuration>

在我配置的作业选项文件中添加了以下内容:

yarn.container.memory.mb=256
yarn.am.container.memory.mb=256

task.opts= -Xms128M -Xmx128M

当作业运行时,我可以看到-Xms128M -Xmx128M选项被忽略并使用默认值。

我看到以下错误。看起来一些内存限制阻止了从Accepted到Running的工作,但是我找不到如何解决它。

Container [pid=23007,containerID=container_1443454508386_0003_01_000001] is running beyond virtual memory limits. Current usage: 13.9 MB of 256 MB physical memory used; 1.1 GB of 537.6 MB virtual memory used. Killing container

实际上,作业只是干净的功能,所以我的代码都不应该引入噪音。

知道问题是什么?

更新: 在ACCEPTED状态下停留约10分钟后,它将进入失败状态。 以下是我在yarn-root-resourcemanager-kfk-samza01.out日志中看到的内容的一部分:

2015-09-30 14:08:07,000 INFO  [ResourceManager Event Processor] resourcemanager.RMAuditLogger (RMAuditLogger.java:logSuccess(106)) - USER=root  OPERATION=AM Allocated Container     TARGET=SchedulerApp     RESULT=SUCCESS  APPID=application_1443613686881_0001    CONTAINERID=container_1443613686881_0001_02_000001
2015-09-30 14:08:07,000 INFO  [ResourceManager Event Processor] scheduler.SchedulerNode (SchedulerNode.java:allocateContainer(153)) - Assigned container container_1443613686881_0001_02_000001 of capacity <memory:1024, vCores:1> on host kfk-samza01:44816, which has 1 containers, <memory:1024, vCores:1> used and <memory:7168, vCores:7> available after allocation
2015-09-30 14:08:07,001 INFO  [ResourceManager Event Processor] capacity.LeafQueue (LeafQueue.java:assignContainer(1580)) - assignedContainer application attempt=appattempt_1443613686881_0001_000002 container=Container: [ContainerId: container_1443613686881_0001_02_000001, NodeId: kfk-samza01:44816, NodeHttpAddress: kfk-samza01:8042, Resource: <memory:1024, vCores:1>, Priority: 0, Token: null, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:0, vCores:0>, usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=1, numContainers=0 clusterResource=<memory:16384, vCores:16>
2015-09-30 14:08:07,002 INFO  [ResourceManager Event Processor] capacity.ParentQueue (ParentQueue.java:assignContainersToChildQueues(559)) - Re-sorting assigned queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:1024, vCores:1>, usedCapacity=0.0625, absoluteUsedCapacity=0.0625, numApps=1, numContainers=1
2015-09-30 14:08:07,002 INFO  [ResourceManager Event Processor] capacity.ParentQueue (ParentQueue.java:assignContainers(424)) - assignedContainer queue=root usedCapacity=0.0625 absoluteUsedCapacity=0.0625 used=<memory:1024, vCores:1> cluster=<memory:16384, vCores:16>
2015-09-30 14:08:07,005 INFO  [AsyncDispatcher event handler] security.NMTokenSecretManagerInRM (NMTokenSecretManagerInRM.java:createAndGetNMToken(200)) - Sending NMToken for nodeId : kfk-samza01:44816 for container : container_1443613686881_0001_02_000001
2015-09-30 14:08:07,009 INFO  [AsyncDispatcher event handler] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(408)) - container_1443613686881_0001_02_000001 Container Transitioned from ALLOCATED to ACQUIRED
2015-09-30 14:08:07,009 INFO  [AsyncDispatcher event handler] security.NMTokenSecretManagerInRM (NMTokenSecretManagerInRM.java:clearNodeSetForAttempt(146)) - Clear node set for appattempt_1443613686881_0001_000002
2015-09-30 14:08:07,010 INFO  [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:storeAttempt(1830)) - Storing attempt: AppId: application_1443613686881_0001 AttemptId: appattempt_1443613686881_0001_000002 MasterContainer: Container: [ContainerId: container_1443613686881_0001_02_000001, NodeId: kfk-samza01:44816, NodeHttpAddress: kfk-samza01:8042, Resource: <memory:1024, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 192.168.15.92:44816 }, ]
2015-09-30 14:08:07,010 INFO  [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(762)) - appattempt_1443613686881_0001_000002 State change from SCHEDULED to ALLOCATED_SAVING
2015-09-30 14:08:07,011 INFO  [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(762)) - appattempt_1443613686881_0001_000002 State change from ALLOCATED_SAVING to ALLOCATED
2015-09-30 14:08:07,012 INFO  [pool-1-thread-3] amlauncher.AMLauncher (AMLauncher.java:run(253)) - Launching masterappattempt_1443613686881_0001_000002
2015-09-30 14:08:07,018 INFO  [pool-1-thread-3] amlauncher.AMLauncher (AMLauncher.java:launch(106)) - Setting up container Container: [ContainerId: container_1443613686881_0001_02_000001, NodeId: kfk-samza01:44816, NodeHttpAddress: kfk-samza01:8042, Resource: <memory:1024, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 192.168.15.92:44816 }, ] for AM appattempt_1443613686881_0001_000002
2015-09-30 14:08:07,019 INFO  [pool-1-thread-3] amlauncher.AMLauncher (AMLauncher.java:createAMContainerLaunchContext(191)) - Command to launch container container_1443613686881_0001_02_000001 : export SAMZA_LOG_DIR=<LOG_DIR> && ln -sfn <LOG_DIR> logs && exec ./__package/bin/run-am.sh 1>logs/stdout 2>logs/stderr
2015-09-30 14:08:07,020 INFO  [pool-1-thread-3] security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:createAndGetAMRMToken(195)) - Create AMRMToken for ApplicationAttempt: appattempt_1443613686881_0001_000002
2015-09-30 14:08:07,020 INFO  [pool-1-thread-3] security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:createPassword(307)) - Creating password for appattempt_1443613686881_0001_000002
2015-09-30 14:08:07,064 INFO  [pool-1-thread-3] amlauncher.AMLauncher (AMLauncher.java:launch(127)) - Done launching container Container: [ContainerId: container_1443613686881_0001_02_000001, NodeId: kfk-samza01:44816, NodeHttpAddress: kfk-samza01:8042, Resource: <memory:1024, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 192.168.15.92:44816 }, ] for AM appattempt_1443613686881_0001_000002
2015-09-30 14:08:07,065 INFO  [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(762)) - appattempt_1443613686881_0001_000002 State change from ALLOCATED to LAUNCHED
2015-09-30 14:08:08,001 INFO  [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(408)) - container_1443613686881_0001_02_000001 Container Transitioned from ACQUIRED to RUNNING
2015-09-30 14:21:26,930 INFO  [Ping Checker] util.AbstractLivelinessMonitor (AbstractLivelinessMonitor.java:run(127)) - Expired:appattempt_1443613686881_0001_000002 Timed out after 600 secs
2015-09-30 14:21:26,931 INFO  [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:rememberTargetTransitionsAndStoreState(1125)) - Updating application attempt appattempt_1443613686881_0001_000002 with final state: FAILED, and exit status: -1000
2015-09-30 14:21:26,931 INFO  [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(762)) - appattempt_1443613686881_0001_000002 State change from LAUNCHED to FINAL_SAVING
2015-09-30 14:21:26,932 INFO  [AsyncDispatcher event handler] resourcemanager.ApplicationMasterService (ApplicationMasterService.java:unregisterAttempt(677)) - Unregistering app attempt : appattempt_1443613686881_0001_000002
2015-09-30 14:21:26,932 INFO  [AsyncDispatcher event handler] security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:applicationMasterFinished(124)) - Application finished, removing password for appattempt_1443613686881_0001_000002
2015-09-30 14:21:26,933 INFO  [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(762)) - appattempt_1443613686881_0001_000002 State change from FINAL_SAVING to FAILED
2015-09-30 14:21:26,933 INFO  [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:transition(1208)) - The number of failed attempts is 2. The max attempts is 2
2015-09-30 14:21:26,935 INFO  [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:rememberTargetTransitionsAndStoreState(995)) - Updating application application_1443613686881_0001 with final state: FAILED
2015-09-30 14:21:26,937 INFO  [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:handle(721)) - application_1443613686881_0001 State change from ACCEPTED to FINAL_SAVING
2015-09-30 14:21:26,938 INFO  [ResourceManager Event Processor] capacity.CapacityScheduler (CapacityScheduler.java:doneApplicationAttempt(790)) - Application Attempt appattempt_1443613686881_0001_000002 is done. finalState=FAILED
2015-09-30 14:21:26,938 INFO  [AsyncDispatcher event handler] recovery.RMStateStore (RMStateStore.java:transition(161)) - Updating info for app: application_1443613686881_0001
2015-09-30 14:21:26,939 INFO  [ResourceManager Event Processor] rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(408)) - container_1443613686881_0001_02_000001 Container Transitioned from RUNNING to KILLED
2015-09-30 14:21:26,939 INFO  [ResourceManager Event Processor] fica.FiCaSchedulerApp (FiCaSchedulerApp.java:containerCompleted(113)) - Completed container: container_1443613686881_0001_02_000001 in state: KILLED event:KILL
2015-09-30 14:21:26,939 INFO  [ResourceManager Event Processor] resourcemanager.RMAuditLogger (RMAuditLogger.java:logSuccess(106)) - USER=root  OPERATION=AM Released Container      TARGET=SchedulerApp     RESULT=SUCCESS  APPID=application_1443613686881_0001    CONTAINERID=container_1443613686881_0001_02_000001
2015-09-30 14:21:26,940 INFO  [ResourceManager Event Processor] scheduler.SchedulerNode (SchedulerNode.java:releaseContainer(216)) - Released container container_1443613686881_0001_02_000001 of capacity <memory:1024, vCores:1> on host kfk-samza01:44816, which currently has 0 containers, <memory:0, vCores:0> used and <memory:8192, vCores:8> available, release resources=true
2015-09-30 14:21:26,940 INFO  [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:transition(945)) - Application application_1443613686881_0001 failed 2 times due to ApplicationMaster for attempt appattempt_1443613686881_0001_000002 timed out. Failing the application.
2015-09-30 14:21:26,940 INFO  [ResourceManager Event Processor] capacity.LeafQueue (LeafQueue.java:releaseResource(1732)) - default used=<memory:0, vCores:0> numContainers=0 user=root user-resources=<memory:0, vCores:0>
2015-09-30 14:21:26,943 INFO  [ResourceManager Event Processor] capacity.LeafQueue (LeafQueue.java:completedContainer(1683)) - completedContainer container=Container: [ContainerId: container_1443613686881_0001_02_000001, NodeId: kfk-samza01:44816, NodeHttpAddress: kfk-samza01:8042, Resource: <memory:1024, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 192.168.15.92:44816 }, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:0, vCores:0>, usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=1, numContainers=0 cluster=<memory:16384, vCores:16>
2015-09-30 14:21:26,943 INFO  [ResourceManager Event Processor] capacity.ParentQueue (ParentQueue.java:completedContainer(604)) - completedContainer queue=root usedCapacity=0.0 absoluteUsedCapacity=0.0 used=<memory:0, vCores:0> cluster=<memory:16384, vCores:16>
2015-09-30 14:21:26,944 INFO  [ResourceManager Event Processor] capacity.ParentQueue (ParentQueue.java:completedContainer(622)) - Re-sorting completed queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:0, vCores:0>, usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=1, numContainers=0
2015-09-30 14:21:26,944 INFO  [ResourceManager Event Processor] capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1274)) - Application attempt appattempt_1443613686881_0001_000002 released container container_1443613686881_0001_02_000001 on node: host: kfk-samza01:44816 #containers=0 available=8192 used=0 with event: KILL
2015-09-30 14:21:26,945 INFO  [ResourceManager Event Processor] scheduler.AppSchedulingInfo (AppSchedulingInfo.java:clearRequests(115)) - Application application_1443613686881_0001 requests cleared
2015-09-30 14:21:26,945 INFO  [ResourceManager Event Processor] capacity.LeafQueue (LeafQueue.java:removeApplicationAttempt(682)) - Application removed - appId: application_1443613686881_0001 user: root queue: default #user-pending-applications: 0 #user-active-applications: 0 #queue-pending-applications: 0 #queue-active-applications: 0
2015-09-30 14:21:26,946 INFO  [pool-1-thread-4] amlauncher.AMLauncher (AMLauncher.java:run(267)) - Cleaning master appattempt_1443613686881_0001_000002
2015-09-30 14:21:26,948 INFO  [AsyncDispatcher event handler] rmapp.RMAppImpl (RMAppImpl.java:handle(721)) - application_1443613686881_0001 State change from FINAL_SAVING to FAILED
2015-09-30 14:21:26,949 INFO  [ResourceManager Event Processor] capacity.ParentQueue (ParentQueue.java:removeApplication(372)) - Application removed - appId: application_1443613686881_0001 user: root leaf-queue of parent: root #applications: 0
2015-09-30 14:21:26,951 WARN  [AsyncDispatcher event handler] resourcemanager.RMAuditLogger (RMAuditLogger.java:logFailure(263)) - USER=root    OPERATION=Application Finished - Failed      TARGET=RMAppManager     RESULT=FAILURE  DESCRIPTION=App failed with state: FAILED       PERMISSIONS=Application application_1443613686881_0001 failed 2 times due to ApplicationMaster for attempt appattempt_1443613686881_0001_000002 timed out. Failing the application.  APPID=application_1443613686881_0001
2015-09-30 14:21:26,955 INFO  [AsyncDispatcher event handler] resourcemanager.RMAppManager$ApplicationSummary (RMAppManager.java:logAppSummary(179)) - appId=application_1443613686881_0001,name=flow.Router_1,user=root,queue=default,state=FAILED,trackingUrl=http://kfk-samza01:8088/cluster/app/application_1443613686881_0001,appMasterHost=N/A,startTime=1443614243319,finishTime=1443615686935,finalStatus=FAILED

有关正在发生的事情的任何线索?

2 个答案:

答案 0 :(得分:1)

请尝试使用以下作业配置属性来限制容器内存分配。

mapreduce.map.memory.mb
mapreduce.reduce.memory.mb

根据您的情况,这两个属性值可以是256MB

并配置以下两个属性

mapreduce.map.java.opts
mapreduce.reduce.java.opts

根据您的情况,这两个属性的值应为128MB

[注意:以上两个*.java.opts值必须略低于相应的*.memory.mb属性

如果您仍然继续遇到虚拟内存问题,请尝试通过配置以下属性来降低虚拟内存分配的配给值。

yarn.nodemanager.vmem-pmem-ratio

默认为2.1,如果您仍然遇到虚拟内存问题,请尝试减少它。

正确设置这些属性后,您将在成功完成后清除容器。

希望这有帮助。

答案 1 :(得分:1)

最后,我有两个并行的问题。一,hserus已经解释过已经解决的内存限制。

另一个是kafka服务器的通信问题,这些服务器已经引发了主题的破坏,因此作业无法运行。