我正在使用spark version 1.6.3
,yarn version 2.7.1.2.3
附带HDP-2.3.0.0-2557
。因为我使用的HDP版本中的火花版太旧了,我更喜欢远程使用另一种火花作为纱线模式。
以下是我运行spark shell的方法;
./spark-shell --master yarn-client
一切似乎都很好,sparkContext
已初始化,sqlContext
已初始化。我甚至可以访问我的蜂巢表。但在某些情况下,当它尝试连接到块管理器时会遇到麻烦。
我不是专家,但我认为,当我在纱线模式下运行它时,块管理器正在我的纱线集群上运行。这对我来说似乎是一个网络问题,并且不想在这里问它。但是,在某些我无法弄清楚的情况下会发生这种情况。所以这让我觉得这可能不是网络问题。
这是代码;
def df = sqlContext.sql("select * from city_table")
以下代码运作正常;
df.limit(10).count()
但是大小超过10,我不知道,每次运行都会发生变化;
df.count()
这引发了例外;
6/12/30 07:31:04 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 2 is 157 bytes
16/12/30 07:31:19 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 8, 172.27.247.204): FetchFailed(BlockManagerId(2, 172.27.247.204, 56093), shuffleId=2, mapId=0, reduceId=0, message=
org.apache.spark.shuffle.FetchFailedException: Failed to connect to /172.27.247.204:56093
at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:504)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.<init>(TungstenAggregationIterator.scala:686)
at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to /172.27.247.204:56093
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
Caused by: java.net.ConnectException: Connection refused: /172.27.247.204:56093
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more
)
我可以意识到当有多个任务要洗牌时会发生这种情况。
问题是什么,是性能问题还是我看不到的其他网络问题。什么是洗牌?如果它是网络问题,它是在我的火花和纱线之间,还是纱线本身的问题?
谢谢。
我只是在日志中看到了一些内容;
17/01/02 06:45:17 INFO DAGScheduler: Executor lost: 2 (epoch 13)
17/01/02 06:45:17 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
17/01/02 06:45:17 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(2, 172.27.247.204, 51809)
17/01/02 06:45:17 INFO BlockManagerMaster: Removed 2 successfully in removeExecutor
17/01/02 06:45:17 INFO YarnScheduler: Removed TaskSet 3.0, whose tasks have all completed, from pool
17/01/02 06:45:24 INFO BlockManagerMasterEndpoint: Registering block manager 172.27.247.204:51809 with 511.1 MB RAM, BlockManagerId(2, 172.27.247.204, 51809)
有时候,在另一个块管理器上重试它会起作用,但是,因为超出了默认值4的最大允许次数,所以它绝不会在大部分时间结束。
Yarn真的对此非常沉默,但我认为这是网络问题,我可以将问题迭代到某个地方;
此火花部署在HDP环境之外。当火花向纱线提交申请时,纱线会通知火花司机有关区块经理和执行人员的信息。执行程序是HDP集群中的数据节点,并且在其专用网络中具有不同的IP。但是,当涉及通知集群外部的火花驱动程序时,它为所有执行程序提供相同且始终单一的IP。这是因为HDP集群中的所有节点都通过路由器并使用相同的IP。假设IP为150.150.150.150
,当火花驱动程序需要连接并向执行程序询问某些内容时,它会尝试使用此IP。但是这个IP实际上是整个集群的外部IP地址,而不是单个数据节点IP。
有没有办法让纱线通过私人IP告知执行者(Block Managers)。因为,他们的私人IP也可以从这个火花司机正在使用的机器上访问。
答案 0 :(得分:5)
FetchFailedException
)无法获取shuffle块时,抛出 ShuffleDependency
异常。它通常意味着执行者(用于shuffle块的BlockManager
)死亡,因此异常:
Caused by: java.io.IOException: Failed to connect to /172.27.247.204:56093
执行程序可以OOMed(= OutOfMemoryError抛出)或YARN决定因内存使用过多而终止它。
您应该使用yarn logs命令查看Spark应用程序的日志,并找出问题的根本原因。
yarn logs -applicationId <application ID> [options]
您还可以在Web UI的Executors选项卡中查看Spark应用程序执行程序的状态。
Spark通常会通过重新运行受影响的任务从FetchFailedException
恢复。使用Web UI查看Spark应用程序的执行情况。 FetchFailedException
可能是由于临时记忆&#34;打嗝&#34;。
答案 1 :(得分:1)