如何在spark中使用sc.textFile加载本地文件?

时间:2016-12-27 04:46:02

标签: scala file apache-spark

我一直试图在spark中使用sc.textFile()加载本地文件。

我已经阅读了[问题]:How to load local file in sc.textFile, instead of HDFS

我在Centos 7.0上的/home/spark/data.txt中有本地文件

当我使用val data = sc.textFile("file:///home/spark/data.txt").collect时,我收到如下错误。

  

16/12/27 12:15:56 WARN TaskSetManager:阶段5.0中的丢失任务0.0(TID   36,):java.io.FileNotFoundException:文件文件:/home/spark/data.txt没有   存在           在org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:609)           at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:822)           在org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:599)           在org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)           在org.apache.hadoop.fs.ChecksumFileSystem $ ChecksumFSInputChecker。(ChecksumFileSystem.java:140)           在org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)           在org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)           在org.apache.hadoop.mapred.LineRecordReader。(LineRecordReader.java:109)           at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)           在org.apache.spark.rdd.HadoopRDD $$ anon $ 1.(HadoopRDD.scala:246)           在org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:209)           在org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102)           在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)           在org.apache.spark.rdd.RDD.iterator(RDD.scala:283)           在org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)           在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)           在org.apache.spark.rdd.RDD.iterator(RDD.scala:283)           在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)           在org.apache.spark.scheduler.Task.run(Task.scala:85)           在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:274)           在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)           at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617)           在java.lang.Thread.run(Thread.java:745)

     

16/12/27 12:15:56错误TaskSetManager:阶段5.0中的任务0失败4   倍;中止作业org.apache.spark.SparkException:作业中止到期   阶段性失败:阶段5.0中的任务0失败了4次,最近一次   失败:5.0阶段(TID 42,)丢失任务0.3:   java.io.FileNotFoundException:文件文件:/home/spark/data.txt不存在           在org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:609)           at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:822)           在org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:599)           在org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)           在org.apache.hadoop.fs.ChecksumFileSystem $ ChecksumFSInputChecker。(ChecksumFileSystem.java:140)           在org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)           在org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)           在org.apache.hadoop.mapred.LineRecordReader。(LineRecordReader.java:109)           at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)           在org.apache.spark.rdd.HadoopRDD $$ anon $ 1.(HadoopRDD.scala:246)           在org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:209)           在org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102)           在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)           在org.apache.spark.rdd.RDD.iterator(RDD.scala:283)           在org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)           在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)           在org.apache.spark.rdd.RDD.iterator(RDD.scala:283)           在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)           在org.apache.spark.scheduler.Task.run(Task.scala:85)           在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:274)           在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)           at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617)           在java.lang.Thread.run(Thread.java:745)

     

驱动程序堆栈跟踪:at   org.apache.spark.scheduler.DAGScheduler.org $阿帕奇$火花$ $调度$$ DAGScheduler failJobAndIndependentStages(DAGScheduler.scala:1450)   在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.适用(DAGScheduler.scala:1438)   在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.适用(DAGScheduler.scala:1437)   在   scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59)   在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)   在   org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)   在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.适用(DAGScheduler.scala:811)   在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.适用(DAGScheduler.scala:811)   在scala.Option.foreach(Option.scala:257)at   org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)   在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)   在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)   在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)   在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48)
  在   org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)   在org.apache.spark.SparkContext.runJob(SparkContext.scala:1871)at   org.apache.spark.SparkContext.runJob(SparkContext.scala:1884)at at   org.apache.spark.SparkContext.runJob(SparkContext.scala:1897)at at   org.apache.spark.SparkContext.runJob(SparkContext.scala:1911)at at   org.apache.spark.rdd.RDD $$ anonfun $ collect $ 1.apply(RDD.scala:893)at at   org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151)   在   org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:112)   在org.apache.spark.rdd.RDD.withScope(RDD.scala:358)at   org.apache.spark.rdd.RDD.collect(RDD.scala:892)... 48 elided   by:java.io.FileNotFoundException:文件文件:/home/spark/data.txt不存在   在   org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:609)   在   org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:822)   在   org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:599)   在   org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)   在   org.apache.hadoop.fs.ChecksumFileSystem $ ChecksumFSInputChecker。(ChecksumFileSystem.java:140)   在   org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)   在org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)at   org.apache.hadoop.mapred.LineRecordReader。(LineRecordReader.java:109)   在   org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)   在org.apache.spark.rdd.HadoopRDD $$ anon $ 1.(HadoopRDD.scala:246)   在org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:209)at at   org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102)at at   org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)at at   org.apache.spark.rdd.RDD.iterator(RDD.scala:283)at   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)   在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
  在org.apache.spark.rdd.RDD.iterator(RDD.scala:283)at   org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
  在org.apache.spark.scheduler.Task.run(Task.scala:85)at   org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:274)   在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)   在   java.util.concurrent.ThreadPoolExecutor中的$ Worker.run(ThreadPoolExecutor.java:617)   在java.lang.Thread.run(Thread.java:745)

显然此路径中有一个文件。如果我使用错误的路径,则错误如下所示。

 val data = sc.textFile("file:///data.txt").collect
  

org.apache.hadoop.mapred.InvalidInputException:输入路径没有   存在:file:/data.txt at   org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)   在   org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)   在   org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)   在org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
  在org.apache.spark.rdd.RDD $$ anonfun $ partitions $ 2.apply(RDD.scala:248)   在org.apache.spark.rdd.RDD $$ anonfun $ partitions $ 2.apply(RDD.scala:246)   在scala.Option.getOrElse(Option.scala:121)at   org.apache.spark.rdd.RDD.partitions(RDD.scala:246)at   org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)   在org.apache.spark.rdd.RDD $$ anonfun $ partitions $ 2.apply(RDD.scala:248)   在org.apache.spark.rdd.RDD $$ anonfun $ partitions $ 2.apply(RDD.scala:246)   在scala.Option.getOrElse(Option.scala:121)at   org.apache.spark.rdd.RDD.partitions(RDD.scala:246)at   org.apache.spark.SparkContext.runJob(SparkContext.scala:1911)at at   org.apache.spark.rdd.RDD $$ anonfun $ collect $ 1.apply(RDD.scala:893)at at   org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151)   在   org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:112)   在org.apache.spark.rdd.RDD.withScope(RDD.scala:358)at   org.apache.spark.rdd.RDD.collect(RDD.scala:892)

我不知道为什么它不起作用。

有什么想法吗?

4 个答案:

答案 0 :(得分:1)

将该文件复制到$ SPARK_HOME文件夹并使用以下命令:val data = sc.textFile("data.txt").collect

答案 1 :(得分:0)

使用此val data = sc.textFile("/home/spark/data.txt")这应该有用 并将master设置为local。

答案 2 :(得分:0)

您的数据文件必须存在于所有执行者节点上的'home / spark / data.txt'中。我知道这有点荒谬。要解决此问题,您可以使用以下选项:

  1. 将数据文件移动到HDFS
  2. 在所有执行者节点上复制数据文件
  3. 在纯Scala(不是Spark)中加载文件,然后使用sc.parallelize()创建RDD。

答案 3 :(得分:0)

问题在于我们的本地设备与spark本地设备不同。因此,当您运行pyspark时,必须提到您的代码必须在本地计算机上运行,​​尤其是在使用AWS EC2时。所以只需运行 ./pyspark --master本地[n] 之后,您的本地和spark本地将是相同的..... 别忘了使用(file:/// ....)