我正在获取没有用Spark编写的随机文件的随机实例。
15/12/29 17:30:26 ERROR server.TransportRequestHandler: Error sending result
ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=347837678000, chunkIndex=0},
buffer=FileSegmentManagedBuffer{file=/data/24/hadoop/yarn/local/usercache/root/appcache
/application_1451416375261_0032/blockmgr-c2e951bb-856d-487f-a5be-2b3194fdfba6/1a/
shuffle_0_35_0.data, offset=1088736267, length=8082368}}
to /10.7.230.74:42318; closing connection
java.io.FileNotFoundException:
/data/24/hadoop/yarn/local/usercache/root/appcache/application_1451416375261_0032/
blockmgr-c2e951bb-856d-487f-a5be-2b3194fdfba6/1a/shuffle_0_35_0.data
(No such file or directory)
at java.io.FileInputStream.open0(Native Method)
...
似乎大多数shuffle文件都已成功写入,而不是所有。
这是洗牌阶段 - 或'阅读随机文件'阶段。
首先,所有执行程序都能够读取文件。最终,并且不可避免地,其中一个执行程序抛出上述异常并被删除。所有其他人都开始失败,因为他们无法检索那些随机播放的文件。
我在每个执行器中都有40GB的RAM,而且我有8个执行器。额外的一个是这个列表是因为失败后删除了执行程序。我的数据很大,但我没有看到任何out of memory
问题。
有什么想法吗?
将我的repartition
调用从1000个分区更改为100000个分区,现在我正在获得新的堆栈跟踪。
Job aborted due to stage failure: Task 71 in stage 9.0 failed 4 times, most recent failure: Lost task 71.3 in stage 9.0 (TID 2831, dev-node1): java.io.IOException: FAILED_TO_UNCOMPRESS(5)
at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)
at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444)
at org.xerial.snappy.Snappy.uncompress(Snappy.java:480)
at org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:135)
at org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:92)
at org.xerial.snappy.SnappyInputStream.<init>(SnappyInputStream.java:58)
at org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:159)
at org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1179)
at org.apache.spark.shuffle.hash.HashShuffleReader$$anonfun$3.apply(HashShuffleReader.scala:53)
at org.apache.spark.shuffle.hash.HashShuffleReader$$anonfun$3.apply(HashShuffleReader.scala:52)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:173)
at org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$executePartition$1(sort.scala:160)
at org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$4.apply(sort.scala:169)
at org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$4.apply(sort.scala:169)
at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
...