Question

我是Spark的新手。我有一个文件TrainDataSpark.java，我在其中处理一些数据，最后我将我的火花处理数据保存到一个名为Predictions的目录中，代码如下

predictions.saveAsTextFile("Predictions");

在同一个TrainDataSpark.java中，我正在上面的行之后添加下面的代码部分。

OutputGeneratorOptimized ouputGenerator = new OutputGeneratorOptimized();
final Path predictionFilePath = Paths.get("/Predictions/part-00000");
final Path outputHtml = Paths.get("/outputHtml.html");
ouputGenerator.getFormattedHtml(input,predictionFilePath,outputHtml);

我得到/ Predictions / part-00000的NoSuchFile异常。我已经尝试了所有可能的路径，但它失败了。我认为java代码在我的本地系统而不是hdfs集群中搜索File。有没有办法从集群获取文件路径，所以我可以通过它？或者有没有办法将我的Predictions文件加载到本地而不是集群，以便java部分运行时出错？

Answer 1

如果您在群集上运行Spark，则会发生这种情况。 Paths.get分别在每个节点上的本地文件系统中查找文件，而它存在于hdfs上。您可以使用sc.textFile("hdfs:/Predictions")（或sc.textFile("Predictions")）加载文件。

另一方面，如果您想保存本地文件系统，首先需要collect RDD并使用常规Java IO保存它。

Answer 2

我这样想出来......

String predictionFilePath ="hdfs://pathToHDFS/user/username/Predictions/part-00000";
String outputHtml = "hdfs://pathToHDFS/user/username/outputHtml.html";

URI uriRead = URI.create(predictionFilePath);
URI uriOut = URI.create(outputHtml);

Configuration conf = new Configuration ();

FileSystem fileRead = FileSystem.get (uriRead, conf);
FileSystem fileWrite = FileSystem.get (uriOut, conf);

FSDataInputStream in = fileRead.open(new org.apache.hadoop.fs.Path(uriRead));
FSDataOutputStream out = fileWrite.append(new org.apache.hadoop.fs.Path(uriOut));

/*Java code that uses stream objects to write and read*/
OutputGeneratorOptimized ouputGenerator = new OutputGeneratorOptimized();
ouputGenerator.getFormattedHtml(input,in,out);

如何在java代码

2 个答案: