当我在纱线上运行火花时,发现了额外的容器

时间:2018-01-16 06:56:46

标签: hadoop apache-spark yarn

我试图在纱线上运行PySpark,我的命令是这样的:
PYSPARK_DRIVER_PYTHON="jupyter" PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark --master yarn --deploy-mode client --num-executors 5
 所以我想我会得到6个容器,包括一个AM容器,  但我真的有8个容器。

下面有AM的日志:

18/01/16 13:40:20 INFO client.RMProxy: Connecting to ResourceManager at master/192.180.3.43:8030
18/01/16 13:40:20 INFO yarn.YarnRMClient: Registering the ApplicationMaster
18/01/16 13:40:20 INFO yarn.YarnAllocator: Will request 5 executor container(s), each with 1 core(s) and 1408 MB memory (including 384 MB of overhead)
18/01/16 13:40:20 INFO yarn.YarnAllocator: Submitted 5 unlocalized container requests.
18/01/16 13:40:20 INFO yarn.ApplicationMaster: Started progress reporter thread with (heartbeat : 3000, initial allocation : 200) intervals
18/01/16 13:40:20 INFO impl.AMRMClientImpl: Received new token for : slave5:58817
18/01/16 13:40:20 INFO yarn.YarnAllocator: Launching container container_1516011096691_0014_01_000002 on host slave5 for executor with ID 1
18/01/16 13:40:20 INFO yarn.YarnAllocator: Received 1 containers from YARN, launching executors on 1 of them.
18/01/16 13:40:20 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
18/01/16 13:40:20 INFO impl.ContainerManagementProtocolProxy: Opening proxy : slave5:58817
18/01/16 13:40:20 INFO impl.AMRMClientImpl: Received new token for : slave1:2917
18/01/16 13:40:20 INFO impl.AMRMClientImpl: Received new token for : slave6:13029
18/01/16 13:40:20 INFO yarn.YarnAllocator: Launching container container_1516011096691_0014_01_000003 on host slave1 for executor with ID 2
18/01/16 13:40:20 INFO yarn.YarnAllocator: Launching container container_1516011096691_0014_01_000004 on host slave6 for executor with ID 3
18/01/16 13:40:20 INFO yarn.YarnAllocator: Received 2 containers from YARN, launching executors on 2 of them.
18/01/16 13:40:20 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
18/01/16 13:40:20 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
18/01/16 13:40:20 INFO impl.ContainerManagementProtocolProxy: Opening proxy : slave1:2917
18/01/16 13:40:20 INFO impl.ContainerManagementProtocolProxy: Opening proxy : slave6:13029
18/01/16 13:40:20 INFO yarn.YarnAllocator: Will request 1 executor container(s), each with 1 core(s) and 1408 MB memory (including 384 MB of overhead)
18/01/16 13:40:20 INFO yarn.YarnAllocator: Submitted 1 unlocalized container requests.
18/01/16 13:40:20 INFO impl.AMRMClientImpl: Received new token for : slave7:2725
18/01/16 13:40:20 INFO impl.AMRMClientImpl: Received new token for : slave2:31368
18/01/16 13:40:20 INFO yarn.YarnAllocator: Launching container container_1516011096691_0014_01_000005 on host slave7 for executor with ID 4
18/01/16 13:40:20 INFO yarn.YarnAllocator: Launching container container_1516011096691_0014_01_000006 on host slave2 for executor with ID 5
18/01/16 13:40:20 INFO yarn.YarnAllocator: Received 2 containers from YARN, launching executors on 2 of them.
18/01/16 13:40:20 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
18/01/16 13:40:20 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
18/01/16 13:40:20 INFO impl.ContainerManagementProtocolProxy: Opening proxy : slave7:2725
18/01/16 13:40:20 INFO impl.ContainerManagementProtocolProxy: Opening proxy : slave2:31368
18/01/16 13:40:21 INFO yarn.YarnAllocator: Will request 1 executor container(s), each with 1 core(s) and 1408 MB memory (including 384 MB of overhead)
18/01/16 13:40:21 INFO yarn.YarnAllocator: Submitted 1 unlocalized container requests.
18/01/16 13:40:21 INFO impl.AMRMClientImpl: Received new token for : slave4:32598
18/01/16 13:40:21 INFO impl.AMRMClientImpl: Received new token for : slave3:61485
18/01/16 13:40:21 INFO yarn.YarnAllocator: Launching container container_1516011096691_0014_01_000007 on host slave4 for executor with ID 6
18/01/16 13:40:21 INFO yarn.YarnAllocator: Launching container container_1516011096691_0014_01_000008 on host slave3 for executor with ID 7
18/01/16 13:40:21 INFO yarn.YarnAllocator: Received 3 containers from YARN, launching executors on 2 of them.
18/01/16 13:40:21 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
18/01/16 13:40:21 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
18/01/16 13:40:21 INFO impl.ContainerManagementProtocolProxy: Opening proxy : slave4:32598
18/01/16 13:40:21 INFO impl.ContainerManagementProtocolProxy: Opening proxy : slave3:61485
18/01/16 13:40:24 INFO yarn.YarnAllocator: Received 2 containers from YARN, launching executors on 0 of them.
18/01/16 13:43:20 ERROR yarn.YarnAllocator: Failed to launch executor 3 on container container_1516011096691_0014_01_000004
org.apache.spark.SparkException: Exception while starting container container_1516011096691_0014_01_000004 on host slave6
    at org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:125)
    at org.apache.spark.deploy.yarn.ExecutorRunnable.run(ExecutorRunnable.scala:65)
    at org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1$$anon$1.run(YarnAllocator.scala:523)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.NoRouteToHostException: No Route to Host from  slave4/192.180.5.19 to slave6:13029 failed on socket timeout exception: java.net.NoRouteToHostException: 没有到主机的路由; For more details see:  http://wiki.apache.org/hadoop/NoRouteToHost
    at sun.reflect.GeneratedConstructorAccessor10.newInstance(Unknown Source)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:758)
    at org.apache.hadoop.ipc.Client.call(Client.java:1479)
    at org.apache.hadoop.ipc.Client.call(Client.java:1412)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
    at com.sun.proxy.$Proxy17.startContainers(Unknown Source)
    at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96)
    at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
    at com.sun.proxy.$Proxy18.startContainers(Unknown Source)
    at org.apache.hadoop.yarn.client.api.impl.NMClientImpl.startContainer(NMClientImpl.java:201)
    at org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:122)
    ... 5 more
Caused by: java.net.NoRouteToHostException: 没有到主机的路由
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
    at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
    at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614)
    at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)
    at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
    at org.apache.hadoop.ipc.Client.call(Client.java:1451)
    ... 17 more
18/01/16 13:43:20 ERROR yarn.YarnAllocator: Failed to launch executor 4 on container container_1516011096691_0014_01_000005
org.apache.spark.SparkException: Exception while starting container container_1516011096691_0014_01_000005 on host slave7
    at org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:125)
    at org.apache.spark.deploy.yarn.ExecutorRunnable.run(ExecutorRunnable.scala:65)
    at org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1$$anon$1.run(YarnAllocator.scala:523)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.NoRouteToHostException: No Route to Host from  slave4/192.180.5.19 to slave7:2725 failed on socket timeout exception: java.net.NoRouteToHostException: 没有到主机的路由; For more details see:  http://wiki.apache.org/hadoop/NoRouteToHost
    at sun.reflect.GeneratedConstructorAccessor10.newInstance(Unknown Source)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:758)
    at org.apache.hadoop.ipc.Client.call(Client.java:1479)
    at org.apache.hadoop.ipc.Client.call(Client.java:1412)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
    at com.sun.proxy.$Proxy17.startContainers(Unknown Source)
    at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96)
    at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
    at com.sun.p
18/01/16 13:43:21 INFO yarn.YarnAllocator: Completed container container_1516011096691_0014_01_000004 (state: COMPLETE, exit status: -100)
18/01/16 13:43:21 WARN yarn.YarnAllocator: Container marked as failed: container_1516011096691_0014_01_000004. Exit status: -100. Diagnostics: Container released by application
18/01/16 13:43:21 INFO yarn.YarnAllocator: Completed container container_1516011096691_0014_01_000005 (state: COMPLETE, exit status: -100)
18/01/16 13:43:21 WARN yarn.YarnAllocator: Container marked as failed: container_1516011096691_0014_01_000005. Exit status: -100. Diagnostics: Container released by application
18/01/16 13:43:24 INFO yarn.YarnAllocator: Will request 2 executor container(s), each with 1 core(s) and 1408 MB memory (including 384 MB of overhead)
18/01/16 13:43:24 INFO yarn.YarnAllocator: Submitted 2 unlocalized container requests.
18/01/16 13:43:25 INFO yarn.YarnAllocator: Launching container container_1516011096691_0014_01_000012 on host slave3 for executor with ID 8
18/01/16 13:43:25 INFO yarn.YarnAllocator: Launching container container_1516011096691_0014_01_000013 on host slave5 for executor with ID 9
18/01/16 13:43:25 INFO yarn.YarnAllocator: Received 2 containers from YARN, launching executors on 2 of them.
18/01/16 13:43:25 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
18/01/16 13:43:25 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
18/01/16 13:43:25 INFO impl.ContainerManagementProtocolProxy: Opening proxy : slave5:58817
18/01/16 13:43:25 INFO impl.ContainerManagementProtocolProxy: Opening proxy : slave3:61485roxy.$Proxy18.startContainers(Unknown Source)
    at org.apache.hadoop.yarn.client.api.impl.NMClientImpl.startContainer(NMClientImpl.java:201)
    at org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:122)
    ... 5 more
Caused by: java.net.NoRouteToHostException: 没有到主机的路由
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
    at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
    at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614)
    at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)
    at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
    at org.apache.hadoop.ipc.Client.call(Client.java:1451)
    ... 17 more
18/01/16 13:43:21 INFO yarn.YarnAllocator: Completed container container_1516011096691_0014_01_000004 (state: COMPLETE, exit status: -100)
18/01/16 13:43:21 WARN yarn.YarnAllocator: Container marked as failed: container_1516011096691_0014_01_000004. Exit status: -100. Diagnostics: Container released by application
18/01/16 13:43:21 INFO yarn.YarnAllocator: Completed container container_1516011096691_0014_01_000005 (state: COMPLETE, exit status: -100)
18/01/16 13:43:21 WARN yarn.YarnAllocator: Container marked as failed: container_1516011096691_0014_01_000005. Exit status: -100. Diagnostics: Container released by application
18/01/16 13:43:24 INFO yarn.YarnAllocator: Will request 2 executor container(s), each with 1 core(s) and 1408 MB memory (including 384 MB of overhead)
18/01/16 13:43:24 INFO yarn.YarnAllocator: Submitted 2 unlocalized container requests.
18/01/16 13:43:25 INFO yarn.YarnAllocator: Launching container container_1516011096691_0014_01_000012 on host slave3 for executor with ID 8
18/01/16 13:43:25 INFO yarn.YarnAllocator: Launching container container_1516011096691_0014_01_000013 on host slave5 for executor with ID 9
18/01/16 13:43:25 INFO yarn.YarnAllocator: Received 2 containers from YARN, launching executors on 2 of them.
18/01/16 13:43:25 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
18/01/16 13:43:25 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
18/01/16 13:43:25 INFO impl.ContainerManagementProtocolProxy: Opening proxy : slave5:58817
18/01/16 13:43:25 INFO impl.ContainerManagementProtocolProxy: Opening proxy : slave3:61485

似乎AM要求来自slave6和slave7出错的一些奴隶的容器,然后废弃了slave6和slave7的容器。现在该应用程序有5个容器。但是在将之前的slave6和slave7容器标记为失败后,yarn.YarnAllocator尝试从slave3和slave5启动另外两个容器,然后将额外的两个容器分配给应用程序。
当集群中的某些节点出错时,如何解决额外执行程序的问题?

1 个答案:

答案 0 :(得分:0)

您已请求5名遗嘱执行人。由于来自slave4的网络(可能是名称解析),尝试在slave6和slave7上启动2个执行程序失败:

  

无从slave4 / 192.180.5.19到slave7:2725的主机路由失败   套接字超时异常

     

从slave4 / 192.180.5.19到slave6:13029的无主机路由失败   套接字超时异常

所以另外两个执行者开始完成所要求的5.这是正常的行为。检查slave4上的slave6和slave7名称解析。