Question

问题

我有一个保存在HDFS中的文件，我想要做的就是运行我的spark应用程序，计算结果javaRDD并使用saveAsTextFile()来存储新的＆＃34;文件＆＃ 34;在HDFS中。

但是，如果文件已经存在，Spark的saveAsTextFile()不起作用。它不会覆盖它。

我尝试了什么

所以我搜索了一个解决方案，我发现使其工作的一种可能方法是在尝试保存新文件之前通过HDFS API删除文件。

我添加了代码：

FileSystem hdfs = FileSystem.get(new Configuration());
Path newFolderPath = new Path("hdfs://node1:50050/hdfs/" +filename);

if(hdfs.exists(newFolderPath)){
    System.out.println("EXISTS");
    hdfs.delete(newFolderPath, true);
}

filerdd.saveAsTextFile("/hdfs/" + filename);

当我尝试运行我的Spark应用程序时，该文件已被删除，但我得到FileNotFoundException。

考虑到这样的事实，当有人试图从路径中读取文件而文件不存在时会发生此异常，这没有任何意义，因为在删除文件后，没有代码试图读取它。 / p>

我的部分代码

 JavaRDD<String> filerdd = sc.textFile("/hdfs/" + filename)    // load the file here
 ...
 ...
 // Transformations here
 filerdd = filerdd.map(....);
 ...
 ...

 // Delete old file here
 FileSystem hdfs = FileSystem.get(new Configuration());
 Path newFolderPath = new Path("hdfs://node1:50050/hdfs/" +filename);

 if(hdfs.exists(newFolderPath)){
    System.out.println("EXISTS");
    hdfs.delete(newFolderPath, true);
 }

 // Write new file here
 filerdd.saveAsTextFile("/hdfs/" + filename);

我想在这里做最简单的事情，但我不知道为什么这不起作用。也许filerdd以某种方式连接到路径??

Answer 1

问题是您使用相同的路径进行输入和输出。 Spark的RDD将被懒散地执行。它会在您拨打saveAsTextFile时运行。此时，您已删除newFolderPath。所以filerdd会抱怨。

无论如何，您不应该使用相同的路径进行输入和输出。

通过Spark覆盖HDFS文件/目录

1 个答案: