为什么Apache Spark任务失败?我认为,由于DAG,即使没有缓存任务也可以重新计算?我实际上是缓存,我得到filenotfoundexception
或以下内容:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 9238.0 failed 4 times, most recent failure: Lost task 0.3 in stage 9238.0 (TID 17337, ip-XXX-XXX-XXX.compute.internal): java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_299_piece0 of broadcast_299
org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:930)
org.apache.spark.broadcast.TorrentBroadcast.readObject(TorrentBroadcast.scala:155)
sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:606)
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:160)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
这很奇怪,因为我在较小的实例上运行了相同的程序而我没有得到filenotfoundexception - no space left on this device
,而是我得到了上述错误。当我说实例大小增加一倍时,它告诉我工作约1小时后设备上没有剩余空间 - 相同的程序,更大的内存和空间不足!是什么给了什么?
答案 0 :(得分:2)
如SPARK-751期刊所述:
现在,我们在每台机器上创建M * R临时文件 shuffle,其中M =地图任务的数量,R =减少任务的数量。 当有很多映射器和缩减器时,这可能相当高 (例如1k map * 1k reduce =单个shuffle的1百万个文件)。该 高数字可能会削弱文件系统并显着减慢速度 系统下来。我们应该将这个数字减少到O(R)而不是O(M * R)。
因此,如果您确实发现您的磁盘用完了inode,您可以尝试以下方法来解决问题: