Spark& hbase:java.io.IOException:由同行

时间:2018-04-28 08:51:06

标签: scala apache-spark hbase

如果你能帮助我,我将不胜感激。

在执行从kafka到hbase的火花串流(代码附加)期间,我们遇到了一个问题“java.io.IOException:Connection reset by peer”(附带完整日志)。

如果我们使用hbase并且在spark设置中启用了动态分配选项,则会出现此问题。如果我们在hdfs(hive表)而不是hbase中写入数据,或者如果关闭动态分配选项,则不会发现错误。

我们尝试更改zookeeper连接,spark执行器空闲超时,网络超时。我们已经尝试改变shuffle块传输服务(NIO),但错误仍然存​​在。如果我们为动态分配设置最小/最大执行者(少于80),那么也没有问题。

问题可能是什么?在Jira和堆栈溢出中存在许多几乎相同的问题,但没有任何帮助。

版本:

HBase 1.2.0-cdh5.14.0
Kafka  3.0.0-1.3.0.0.p0.40
SPARK2 2.2.0.cloudera2-1.cdh5.12.0.p0.232957
hbase-client/hbase-spark(org.apache.hbase) 1.2.0-cdh5.11.1

Spark设置:

--num-executors=80
--conf spark.sql.shuffle.partitions=200
--conf spark.driver.memory=32g
--conf spark.executor.memory=32g
--conf spark.executor.cores=4

集群: 1 + 8个节点,70个CPU,755Gb RAM,x10 HDD,

日志:

    18/04/09 13:51:56 INFO cluster.YarnClusterScheduler: Executor 717 on lang32.ca.sbrf.ru killed by driver.
18/04/09 13:51:56 INFO storage.BlockManagerMaster: Removed 717 successfully in removeExecutor
18/04/09 13:51:56 INFO spark.ExecutorAllocationManager: Existing executor 717 has been removed (new total is 26)
18/04/09 13:51:56 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 705.
18/04/09 13:51:56 INFO scheduler.DAGScheduler: Executor lost: 705 (epoch 45)
18/04/09 13:51:56 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 705 from BlockManagerMaster.
18/04/09 13:51:56 INFO cluster.YarnClusterScheduler: Executor 705 on lang32.ca.sbrf.ru killed by driver.
18/04/09 13:51:56 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(705, lang32.ca.sbrf.ru, 22805, None)
18/04/09 13:51:56 INFO spark.ExecutorAllocationManager: Existing executor 705 has been removed (new total is 25)
18/04/09 13:51:56 INFO storage.BlockManagerMaster: Removed 705 successfully in removeExecutor
18/04/09 13:51:56 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 716.
18/04/09 13:51:56 INFO scheduler.DAGScheduler: Executor lost: 716 (epoch 45)
18/04/09 13:51:56 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 716 from BlockManagerMaster.
18/04/09 13:51:56 INFO cluster.YarnClusterScheduler: Executor 716 on lang32.ca.sbrf.ru killed by driver.
18/04/09 13:51:56 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(716, lang32.ca.sbrf.ru, 28678, None)
18/04/09 13:51:56 INFO spark.ExecutorAllocationManager: Existing executor 716 has been removed (new total is 24)
18/04/09 13:51:56 INFO storage.BlockManagerMaster: Removed 716 successfully in removeExecutor
18/04/09 13:51:56 WARN server.TransportChannelHandler: Exception in connection from /10.116.173.65:57542
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:192)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
        at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
        at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
        at java.lang.Thread.run(Thread.java:748)
18/04/09 13:51:56 ERROR client.TransportResponseHandler: Still have 1 requests outstanding when connection from /10.116.173.65:57542 is closed
18/04/09 13:51:56 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 548.

2 个答案:

答案 0 :(得分:0)

请在此处查看我的相关答案:What are possible reasons for receiving TimeoutException: Futures timed out after [n seconds] when working with Spark

我还花了一些时间来理解为什么Cloudera会说明以下内容:

  

动态分配和Spark Streaming

     

如果您使用Spark Streaming,Cloudera建议您禁用   通过将spark.dynamicAllocation.enabled设置为false来进行动态分配   在运行流媒体应用程序时。

参考:https://www.cloudera.com/documentation/spark2/latest/topics/spark2_known_issues.html#ki_dynamic_allocation_streaming

答案 1 :(得分:0)

尝试设置这两个参数。在写入HBase之前,请尝试缓存 pa_foobar

Dataframe

spark.network.timeout