作业不断失败,并显示错误错误消息:“ org.xerial.snappy.SnappyIOException:[EMPTY_INPUT]无法解压缩空流”-如何调试?

时间:2019-03-04 17:54:52

标签: python apache-spark pyspark

我正在使用pyspark-2.4.0,并且一个大型作业不断崩溃,并显示以下错误消息(保存到镶木地板时或尝试收集结果时):

  

py4j.protocol.Py4JJavaError:调用时发生错误   o2495.collectToPython。 :org.apache.spark.SparkException:作业中止   由于阶段失败:阶段290.0中的任务184失败4次,大多数   最近失败:在阶段290.0中丢失任务184.3(TID 17345,   53.62.154.250,执行程序5):org.xerial.snappy.SnappyIOException:[EMPTY_INPUT]无法在以下位置解压缩空流   org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:94)     在   org.xerial.snappy.SnappyInputStream。(SnappyInputStream.java:59)     在   org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:164)     在   org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:163)     在   org.apache.spark.serializer.SerializerManager.dataDeserializeStream(SerializerManager.scala:209)     在   org.apache.spark.storage.BlockManager $$ anonfun $ getRemoteValues $ 1.apply(BlockManager.scala:698)     在   org.apache.spark.storage.BlockManager $$ anonfun $ getRemoteValues $ 1.apply(BlockManager.scala:696)     在scala.Option.map(Option.scala:146)在   org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:696)     在org.apache.spark.storage.BlockManager.get(BlockManager.scala:820)     在   org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:875)     在org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)处   org.apache.spark.rdd.RDD.iterator(RDD.scala:286)在   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处   org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)     在   org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)     在org.apache.spark.scheduler.Task.run(Task.scala:121)在   org.apache.spark.executor.Executor $ TaskRunner $$ anonfun $ 10.apply(Executor.scala:402)     在org.apache.spark.util.Utils $ .tryWithSafeFinally(Utils.scala:1360)     在   org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:408)     在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)     在   java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)     在java.lang.Thread.run(Thread.java:748)

我的问题是我不知道是哪个操作导致了此问题。错误消息没有对此提供任何指示,并且堆栈跟踪不包含我的任何自定义代码。

有什么想法会导致这种情况,或者我怎么才能找到工作持续失败的地方?

1 个答案:

答案 0 :(得分:1)

当我在线搜索时遇到一个链接:http://mail-archives.apache.org/mod_mbox/spark-issues/201903.mbox/%3CJIRA.13223720.1553507879000.125409.1553586300107@Atlassian.JIRA%3E

其摘要是:

Spark 2.4 use 1.1.7.x snappy-java, but its behavior is different from 1.1.2.x 
which is used in Spark 2.0.x. SnappyOutputStream in 1.1.2.x version always writes a snappy 
header whether or not to write a value, but SnappyOutputStream in 1.1.7.x don't generate 
header if u don't write value into it, so in spark 2.4 if RDD cache a empty value, 
memoryStore will not cache any bytes ( no snappy header ), then it will throw the empty 
error. 

此外,如果您找到了解决方案(除了将Spark降级到2.0v),请在此处告诉我们。