我在运行Spark作业时遇到以下异常。每次工作都停留在同一个阶段。该阶段是SQL查询。我在驱动程序或执行程序日志中都没有看到任何其他异常
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:748)
此异常包含在以下错误中:
ERROR client.TransportResponseHandler: Still have 1 requests outstanding when connection from hostname.domain.com/ip is closed
我在执行程序日志中唯一能找到的是:
INFO memory.TaskMemoryManager: Memory used in task 12302
INFO memory.TaskMemoryManager: Acquired by org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@462e08e3: 32.0 MB
INFO memory.TaskMemoryManager: Acquired by org.apache.spark.unsafe.map.BytesToBytesMap@41bed570: 2.4 GB
INFO memory.TaskMemoryManager: 0 bytes of memory were used by task 12302 but are not associated with specific consumers
INFO memory.TaskMemoryManager: 2634274570 bytes of memory are used for execution and 1826540 bytes of memory are used for storage
INFO sort.UnsafeExternalSorter: Thread 197 spilling sort data of 512.0 MB to disk (0 time so far)
但我不相信这是一个由于记忆而引起的问题。作业在具有相同数据量的不同环境中成功完成。
这是我的spark-submit
:
spark-submit --master yarn-cluster\
--conf spark.speculation=true \
--conf spark.default.parallelism=200 \
--conf spark.executor.memory=16G \
--conf spark.memory.storageFraction=$0.3 \
--conf spark.executor.cores=5 \
--conf spark.driver.memory=2G \
--conf spark.driver.cores=4 \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.initialExecutors=10 \
--conf spark.yarn.executor.memoryOverhead=1638 \
--conf spark.driver.maxResultSize=1G \
--conf spark.sql.autoBroadcastJoinThreshold=-1 \
--class com.test.TestClass Test.jar
我确实在这里和那里读过一些关于类似异常的文章,这些异常指出了增加心跳间隔和网络超时。但我无法找到明确的答案。
如何成功运行此作业?
答案 0 :(得分:0)
这是由于数据问题引起的。
所有左连接的驱动表,空字符串''
作为用于连接到另一个表的其中一列中的数据。 Similariy,另一张表也为该特定列提供了很多空字符串。
这是一个交叉连接的领先者,因为行数过多,所以工作无限期地挂起。
在右表中添加过滤器有助于解决问题:
SELECT
*
FROM
LEFT_TABLE LT
LEFT JOIN
( SELECT
*
FROM
RIGHT_TABLE
WHERE LENGTH(TRIM(PROBLEMATIC_COLUMN)) <> 0 ) RT
ON
LT.PROBLEMATIC_COLUMN = RT.PROBLEMATIC_COLUMN