我正在使用Apache2.6.0 YARN,而我正在尝试测试群集中动态添加和删除节点。
测试启动一个带有2个节点的Job,当Job正在进行时,它会通过杀死dataNode和NodeManager守护进程来删除其中一个节点*(可以删除这样的节点吗?)
*此节点肯定没有运行ResourceManager / ApplicationMaster。
成功删除节点后(我可以从附加的资源管理器日志确认这一点),测试会将其添加回来并等待作业完成。
节点删除日志:
2015-08-14 11:15:56,902 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:host172:36158 Timed out after 60 secs
2015-08-14 11:15:56,903 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating Node host172:36158 as it is now LOST
2015-08-14 11:15:56,904 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: host172:36158 Node Transitioned from RUNNING to LOST
2015-08-14 11:15:56,905 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1439575616861_0001_01_000006 Container Transitioned from RUNNING to KILLED
2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: Completed container: container_1439575616861_0001_01_000006 in state: KILLED event:KILL
2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1439575616861_0001 CONTAINERID=container_1439575616861_0001_01_000006
2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released container container_1439575616861_0001_01_000006 of capacity <memory:1024, vCores:1> on host host172:36158, which currently has 1 containers, <memory:1024, vCores:1> used and <memory:1024, vCores:7> available, release resources=true
2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: default used=<memory:3584, vCores:3> numContainers=3 user=hadoop user-resources=<memory:3584, vCores:3>
2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: completedContainer container=Container: [ContainerId: container_1439575616861_0001_01_000006, NodeId: host172:36158, NodeHttpAddress: host172:8042, Resource: <memory:1024, vCores:1>, Priority: 20, Token: Token { kind: ContainerToken, service: XX.XX.0.2:36158 }, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:3584, vCores:3>, usedCapacity=1.75, absoluteUsedCapacity=1.75, numApps=1, numContainers=3 cluster=<memory:2048, vCores:8>
2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: completedContainer queue=root usedCapacity=1.75 absoluteUsedCapacity=1.75 used=<memory:3584, vCores:3> cluster=<memory:2048, vCores:8>
2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Re-sorting completed queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:3584, vCores:3>, usedCapacity=1.75, absoluteUsedCapacity=1.75, numApps=1, numContainers=3
2015-08-14 11:15:56,906 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application attempt appattempt_1439575616861_0001_000001 released container container_1439575616861_0001_01_000006 on node: host: host172:36158 #containers=1 available=1024 used=1024 with event: KILL
2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1439575616861_0001_01_000005 Container Transitioned from RUNNING to KILLED
2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: Completed container: container_1439575616861_0001_01_000005 in state: KILLED event:KILL
2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1439575616861_0001 CONTAINERID=container_1439575616861_0001_01_000005
2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released container container_1439575616861_0001_01_000005 of capacity <memory:1024, vCores:1> on host host172:36158, which currently has 0 containers, <memory:0, vCores:0> used and <memory:2048, vCores:8> available, release resources=true
2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: default used=<memory:2560, vCores:2> numContainers=2 user=hadoop user-resources=<memory:2560, vCores:2>
2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: completedContainer container=Container: [ContainerId: container_1439575616861_0001_01_000005, NodeId: host172:36158, NodeHttpAddress: host172:8042, Resource: <memory:1024, vCores:1>, Priority: 20, Token: Token { kind: ContainerToken, service: XX.XX.0.2:36158 }, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:2560, vCores:2>, usedCapacity=1.25, absoluteUsedCapacity=1.25, numApps=1, numContainers=2 cluster=<memory:2048, vCores:8>
2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: completedContainer queue=root usedCapacity=1.25 absoluteUsedCapacity=1.25 used=<memory:2560, vCores:2> cluster=<memory:2048, vCores:8>
2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Re-sorting completed queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:2560, vCores:2>, usedCapacity=1.25, absoluteUsedCapacity=1.25, numApps=1, numContainers=2
2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application attempt appattempt_1439575616861_0001_000001 released container container_1439575616861_0001_01_000005 on node: host: host172:36158 #containers=0 available=2048 used=0 with event: KILL
2015-08-14 11:15:56,907 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Removed node host172:36158 clusterResource: <memory:2048, vCores:8>
节点添加日志:
2015-08-14 11:19:43,529 INFO org.apache.hadoop.yarn.util.RackResolver: Resolved host172 to /default-rack
2015-08-14 11:19:43,530 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: NodeManager from node host172(cmPort: 59426 httpPort: 8042) registered with capability: <memory:2048, vCores:8>, assigned nodeId host172:59426
2015-08-14 11:19:43,533 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: host172:59426 Node Transitioned from NEW to RUNNING
2015-08-14 11:19:43,535 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Added node host172:59426 clusterResource: <memory:4096, vCores:16>
问题在于:
工作永远不会完成! 根据日志,在被删除的节点上安排的mapTasks仍然是&#34; RUNNING&#34;使用100%的mapProgress。这些任务永远保持不变。
在AppMasterContainer日志中,我看到它不断尝试连接到上一个节点host172 / XX.XX.XX.XX:36158,虽然它已被删除并添加到另一个端口host172 / XX.XX.XX.XX上: 59426
......
......
2015-08-14 11:25:21,662 INFO [ContainerLauncher #7] org.apache.hadoop.ipc.Client: Retrying connect to server: host172/XX.XX.XX.XX:36158. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
......
......
P.S:作业完成正常,没有动态添加和删除具有相同内存设置的同一群集上的节点。
答案 0 :(得分:0)
在2.6.0上看起来像YARN bug的症状。
删除节点后,理想情况下,AppMaster会重新连接到nodemanager,最终会失败并返回NoRouteToHostException并终止在该节点中启动的容器并将其标记为失败。
但是,有了这个错误,AppMaster到nodemanagers的连接超时会在多个级别重试,导致作业看起来卡住,尽管所有容器都成功了。
https://issues.apache.org/jira/browse/YARN-3238
这已在2.7.0 hadoop版本中修复。