写入文件并随后读取时发生Spark FileNotFound异常

时间:2019-10-31 15:22:34

标签: java apache-spark

我正在尝试在一项工作中执行以下步骤 1)写一个新文件 2)在Spark中将新创建的文件读取为Dataset。

PrintWriter writer = null;
    try {
        writer = new PrintWriter("/tmp/fileName.txt", "UTF-8");
        writer.println("The,first,line");
        writer.println("The,second,line");
        writer.close();
    } catch (FileNotFoundException | UnsupportedEncodingException e) {
        e.printStackTrace();
    }
    Dataset<Row> data = sparkSession.sqlContext().read().format("com.databricks.spark.csv").load("file:///tmp/fileName.txt");
    data.show();

我遇到问题

    Java.io.FileNotFoundException: File file:/tmp/fileName.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:157)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

我想在一项工作中同时完成这两项操作,当我在

上运行该工作时,该工作正常
  

本地模式

但在

上失败
  

独立模式

0 个答案:

没有答案