我正在使用pyspark-2.4.0,并且一个大型作业不断崩溃,并显示以下错误消息(保存到镶木地板时或尝试收集结果时):
py4j.protocol.Py4JJavaError:调用时发生错误 o2495.collectToPython。 :org.apache.spark.SparkException:作业中止 由于阶段失败:阶段290.0中的任务184失败4次,大多数 最近失败:在阶段290.0中丢失任务184.3(TID 17345, 53.62.154.250,执行程序5):org.xerial.snappy.SnappyIOException:[EMPTY_INPUT]无法在以下位置解压缩空流 org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:94) 在 org.xerial.snappy.SnappyInputStream。(SnappyInputStream.java:59) 在 org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:164) 在 org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:163) 在 org.apache.spark.serializer.SerializerManager.dataDeserializeStream(SerializerManager.scala:209) 在 org.apache.spark.storage.BlockManager $$ anonfun $ getRemoteValues $ 1.apply(BlockManager.scala:698) 在 org.apache.spark.storage.BlockManager $$ anonfun $ getRemoteValues $ 1.apply(BlockManager.scala:696) 在scala.Option.map(Option.scala:146)在 org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:696) 在org.apache.spark.storage.BlockManager.get(BlockManager.scala:820) 在 org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:875) 在org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)处 org.apache.spark.rdd.RDD.iterator(RDD.scala:286)在 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) 在 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) 在org.apache.spark.scheduler.Task.run(Task.scala:121)在 org.apache.spark.executor.Executor $ TaskRunner $$ anonfun $ 10.apply(Executor.scala:402) 在org.apache.spark.util.Utils $ .tryWithSafeFinally(Utils.scala:1360) 在 org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:408) 在 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 在 java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624) 在java.lang.Thread.run(Thread.java:748)
我的问题是我不知道是哪个操作导致了此问题。错误消息没有对此提供任何指示,并且堆栈跟踪不包含我的任何自定义代码。
有什么想法会导致这种情况,或者我怎么才能找到工作持续失败的地方?
答案 0 :(得分:1)
其摘要是:
Spark 2.4 use 1.1.7.x snappy-java, but its behavior is different from 1.1.2.x
which is used in Spark 2.0.x. SnappyOutputStream in 1.1.2.x version always writes a snappy
header whether or not to write a value, but SnappyOutputStream in 1.1.7.x don't generate
header if u don't write value into it, so in spark 2.4 if RDD cache a empty value,
memoryStore will not cache any bytes ( no snappy header ), then it will throw the empty
error.
此外,如果您找到了解决方案(除了将Spark降级到2.0v),请在此处告诉我们。