在shuffle文件中的apache spark(1.6)作业中的FileNotFoundException

时间:2016-07-14 07:14:30

标签: apache-spark

我正在研究火花1.6,它失败了我的工作跟随错误

  

java.io.FileNotFoundException:/data/05/dfs/dn/yarn/nm/usercache/willir31/appcache/application_1413512480649_0108/spark-local-20141028214722-43f1/26/shuffle_0_312_0.index(没有这样的文件或目录)           java.io.FileOutputStream.open(Native方法)           java.io.FileOutputStream中。(FileOutputStream.java:221)           org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123)           org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:192)           org.apache.spark.util.collection.ExternalSorter $$ anonfun $ writePartitionedFile $ 4 $$ anonfun $ $适用2.适用(ExternalSorter.scala:733)           org.apache.spark.util.collection.ExternalSorter $$ anonfun $ writePartitionedFile $ 4 $$ anonfun $ $适用2.适用(ExternalSorter.scala:732)           scala.collection.Iterator $ class.foreach(Iterator.scala:727)           org.apache.spark.util.collection.ExternalSorter $ IteratorForPartition.foreach(ExternalSorter.scala:790)           org.apache.spark.util.collection.ExternalSorter $$ anonfun $ writePartitionedFile $ 4.适用(ExternalSorter.scala:732)           org.apache.spark.util.collection.ExternalSorter $$ anonfun $ writePartitionedFile $ 4.适用(ExternalSorter.scala:728)           scala.collection.Iterator $ class.foreach(Iterator.scala:727)           scala.collection.AbstractIterator.foreach(Iterator.scala:1157)           org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:728)           org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:70)           org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)           org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

我正在进行联接操作。当我仔细查看错误并检查我的代码时,我发现它在从dataFrame写回CSV时失败了。但我无法摆脱它。我没有使用hdp,我为所有组件单独安装。

1 个答案:

答案 0 :(得分:2)

当某些任务存在更深层次的问题(如重要数据偏斜)时,通常会发生此类错误。由于您没有提供足够的详细信息(请务必阅读How To AskHow to create a Minimal, Complete, and Verifiable example)和作业统计信息,我能想到的唯一方法是显着增加随机播放分区的数量:

sqlContext.setConf("spark.sql.shuffle.partitions", 2048)