在有待处理的请求时,外部Shuffle服务连接闲置超过120秒

时间:2018-07-02 07:06:21

标签: amazon-web-services apache-spark yarn

我正在纱线上做火花工作。该作业可以在Amazon EMR上正常运行。 (1个主机和2个从机,具有m4.xlarge)

我通过aws ec2机器使用HDP 2.6发行版设置了类似的基础。但是,火花作业停留在一个特定阶段,一段时间后,我在容器日志中收到以下错误。主要错误似乎是随机播放服务处于空闲状态。

  

18/06/25 07:15:31信息spark.MapOutputTrackerWorker:执行抓取;跟踪器端点= NettyRpcEndpointRef(spark://MapOutputTracker@10.210.150.150:44343)   18/06/25 07:15:31 INFO spark.MapOutputTrackerWorker:没有随机播放9的地图输出,无法获取它们   18/06/25 07:15:31 INFO spark.MapOutputTrackerWorker:没有随机播放9的地图输出,无法获取它们   18/06/25 07:15:31 INFO spark.MapOutputTrackerWorker:得到了输出位置   18/06/25 07:15:31 INFO storage.ShuffleBlockFetcherIterator:从1000个块中获取5个非空块   18/06/25 07:15:31 INFO storage.ShuffleBlockFetcherIterator:在0毫秒内启动了1次远程提取   18/06/25 07:15:31 INFO storage.ShuffleBlockFetcherIterator:从1000个块中获取5个非空块   18/06/25 07:15:31 INFO storage.ShuffleBlockFetcherIterator:在0毫秒内启动了0次远程获取   18/06/25 07:15:31 INFO storage.ShuffleBlockFetcherIterator:从1000个块中获取5个非空块   18/06/25 07:15:31 INFO storage.ShuffleBlockFetcherIterator:在0毫秒内启动了1次远程提取   18/06/25 07:15:31 INFO storage.ShuffleBlockFetcherIterator:从1000个块中获取5个非空块   18/06/25 07:15:31 INFO storage.ShuffleBlockFetcherIterator:在1毫秒内启动了1次远程获取   18/06/25 07:15:31 INFO codegen.CodeGenerator:代码在4.822611 ms中生成   18/06/25 07:15:31 INFO codegen.CodeGenerator:代码在8.430244 ms中生成   18/06/25 07:17:31错误server.TransportChannelHandler:与ip-10-210-150-180。******** / 10.210.150.180:7447的连接已静默120000 ms悬而未决的要求。假设连接已断开;如果错误,请调整spark.network.timeout。   18/06/25 07:17:31错误client.TransportResponseHandler:关闭与ip-10-210-150-180。******** / 10.210.150.180:7447的连接时,仍有307个请求待处理   18/06/25 07:17:31信息shuffle.RetryingBlockFetcher:5000毫秒后重试197个未完成块的获取(1/3)   18/06/25 07:17:31错误shuffle.OneForOneBlockFetcher:启动块获取时失败   java.io.IOException:来自ip-10-210-150-180。******** / 10.210.150.180:7447的连接已关闭   在org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:146)   在org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:108)   在io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)   在io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)   在io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)   在io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)   在io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:278)   在io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)   在io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)   在io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)   在io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)   在io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)   在io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)   在io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)   在io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)   在org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:182)   在io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)   在io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)   在io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)   在io.netty.channel.DefaultChannelPipeline $ HeadContext.channelInactive(DefaultChannelPipeline.java:1289)   在io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)   在io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)   在io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:893)   在io.netty.channel.AbstractChannel $ AbstractUnsafe $ 7.run(AbstractChannel.java:691)   在io.netty.util.concurrent.SingleThreadEventExecutor.runAllTask​​s(SingleThreadEventExecutor.java:399)   在io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446)   在io.netty.util.concurrent.SingleThreadEventExecutor $ 2.run(SingleThreadEventExecutor.java:131)   在io.netty.util.concurrent.DefaultThreadFactory $ DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)   在java.lang.Thread.run(Thread.java:748)   18/06/25 07:17:31信息shuffle.RetryingBlockFetcher:5000毫秒后重试166个未完成块的获取(1/3)   18/06/25 07:17:31错误shuffle.OneForOneBlockFetcher:启动块获取时失败   java.io.IOException:来自ip-10-210-150-180。******** / 10.210.150.180:7447的连接已关闭   在org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:146)   在org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:108)   在io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)   在io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)   在io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)   在io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)   在io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:278)   在io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)   在io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)   在io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)   在io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)   在io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)   在io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)   在io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)   在io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)   在org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:182)   在io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)   在io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)   在io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)   在io.netty.channel.DefaultChannelPipeline $ HeadContext.channelInactive(DefaultChannelPipeline.java:1289)   在io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)   在io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)   在io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:893)   在io.netty.channel.AbstractChannel $ AbstractUnsafe $ 7.run(AbstractChannel.java:691)   在io.netty.util.concurrent.SingleThreadEventExecutor.runAllTask​​s(SingleThreadEventExecutor.java:399)   在io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446)   在io.netty.util.concurrent.SingleThreadEventExecutor $ 2.run(SingleThreadEventExecutor.java:131)   在io.netty.util.concurrent.DefaultThreadFactory $ DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)   在java.lang.Thread.run(Thread.java:748)

我目前正在使用以下spark-defaults配置在纱线簇上运行spark

spark.eventLog.dir=hdfs:///user/spark/applicationHistory
spark.eventLog.enabled=true
spark.yarn.historyServer.address=ppv-qa12-tenant8-spark-cluster-master.periscope-solutions.local:18080
spark.shuffle.service.enabled=true
spark.dynamicAllocation.enabled=true
spark.driver.extraLibraryPath=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.executor.extraLibraryPath=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.driver.maxResultSize=0
spark.driver.extraJavaOptions=-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'
spark.executor.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'
spark.executor.memory=5g
spark.driver.memory=1g
spark.executor.cores=4

我在从属机器的nod​​emanager中的yarn-site.xml中进行了以下设置

<configuration>
  <property>
    <name>yarn.application.classpath</name>
    <value>/usr/hdp/current/spark2-client/aux/*,/etc/hadoop/conf,/usr/hdp/current/hadoop-client/*,/usr/hdp/current/hadoop-client/lib/*,/usr/hdp/current/hadoop-hdfs-client/*,/usr/hdp/current/hadoop-hdfs-client/lib/*,/usr/hdp/current/hadoop-yarn-client/*,/usr/hdp/current/hadoop-yarn-client/lib/*</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>spark2_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services.spark2_shuffle.class</name>
    <value>org.apache.spark.network.yarn.YarnShuffleService</value>
  </property>
  <property>
    <name>yarn.nodemanager.container-manager.thread-count</name>
    <value>64</value>
  </property>
  <property>
    <name>yarn.nodemanager.localizer.client.thread-count</name>
    <value>20</value>
  </property>
  <property>
    <name>yarn.nodemanager.vmem-pmem-ratio</name>
    <value>5</value>
  </property>
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>************</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource-tracker.client.thread-count</name>
    <value>64</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.client.thread-count</name>
    <value>64</value>
  </property>
  <property>
    <name>yarn.scheduler.increment-allocation-mb</name>
    <value>32</value>
  </property>
  <property>
    <name>yarn.scheduler.increment-allocation-vcores</name>
    <value>1</value>
  </property>
  <property>
    <name>yarn.scheduler.maximum-allocation-vcores</name>
    <value>128</value>
  </property>
  <property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>32</value>
  </property>
  <property>
    <name>yarn.timeline-service.enabled</name>
    <value>true</value>
  </property>
  <property>
  <name>yarn.nodemanager.resource.cpu-vcores</name>
    <value>8</value>
  </property>
  <property>
  <name>yarn.nodemanager.resource.memory-mb</name>
    <value>11520</value>
  </property>
  <property>
  <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>11520</value>
  </property>
  <property>
  <name>yarn.nodemanager.hostname</name>
    <value>*************</value>
  </property>
</configuration>

编辑:通过一些网络调试,我发现由容器创建的用于与shuffle服务连接的虚拟端口正在积极地拒绝连接。 (telnet立即引发错误)

1 个答案:

答案 0 :(得分:1)

在查看内核和系统活动日志时,我们在/var/log/messages

中发现了以下问题
  

xen_netfront:xennet:skb驾驶火箭:19个插槽

这意味着我们的AWS ec2机器正在丢失网络数据包。

b / n数据传输容器和shuffle服务通过RPC调用(ChunkFetchRequest,ChunkFetchSuccess和ChunkFetchFailure)进行,这些RPC调用已被网络禁止。

有关此日志的更多信息,请参见以下线程。

http://www.brendangregg.com/blog/2014-09-11/perf-kernel-line-tracing.html

该日志消息表示我们超出了可放入驱动程序环形缓冲区队列(16)中的数据包的最大缓冲区大小,并且那些SKB丢失了

Scatter-gather收集多个响应并将其作为单个响应发送,这又导致SKB大小增加。

因此,我们使用以下命令关闭了分散收集功能。

sudo ethtool -K eth0 sg off

此后,不再有数据包丢失。

性能也与我们在EMR中使用过的性能相似。