Spark中的FileNotFound错误

时间:2015-10-27 18:08:13

标签: scala hadoop apache-spark hdfs

我在群集上运行一个简单的火花程序:

val logFile = "/home/hduser/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()

println()
println()
println()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
println()
println()
println() 
println()
println()

我收到以下错误

 15/10/27 19:44:01 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 6) on      
 executor 192.168.0.19: java.io.FileNotFoundException (File   
 file:/home/hduser/README.md does not exist.) [duplicate 6]
 15/10/27 19:44:01 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times;    
 aborting job
 15/10/27 19:44:01 INFO TaskSetManager: Lost task 1.3 in stage 0.0 (TID 7)   
 on executor 192.168.0.19: java.io.FileNotFoundException (File   
 file:/home/hduser/README.md does not exist.) [duplicate 7]
 15/10/27 19:44:01 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks   
 have all completed, from pool 
 15/10/27 19:44:01 INFO TaskSchedulerImpl: Cancelling stage 0
 15/10/27 19:44:01 INFO DAGScheduler: ResultStage 0 (count at  
 SimpleApp.scala:55) failed in 7.636 s
 15/10/27 19:44:01 INFO DAGScheduler: Job 0 failed: count at  
 SimpleApp.scala:55, took 7.810387 s
 Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, 192.168.0.19): java.io.FileNotFoundException: File file:/home/hduser/README.md does not exist.
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:125)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:78)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:51)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

文件位于正确的位置。如果我用REAMDME替换README.MD.TXT它将工作得很好。有人可以帮忙吗?

由于

2 个答案:

答案 0 :(得分:3)

如果您正在运行多节点群集,请确保所有节点都将文件放在相同的给定路径中,与其自己的文件系统相关。或者,你知道,只需使用HDFS。

在多节点情况下"/home/hduser/README.md"路径也分配给工作节点。 README.md可能仅存在于主节点上。现在,当工作人员试图访问此文件时,他们不会查看主人的fs,而是每个人都会尝试在他们自己的fs上找到它。如果每个节点中的同一路径上都有相同的文件。代码非常有用。为此,请使用相同的路径将文件复制到每个节点的fs

正如您已经注意到的,上述解决方案非常麻烦。 Hadoop FS,HDFS,解决了这个问题等等。你应该调查它。

答案 1 :(得分:2)

这只是因为file with an extension .md包含带有格式信息的纯文本。使用.txt扩展名保存此文件时,将删除或不考虑格式设置信息。 sc.textFile()适用于纯文本。