Question

我必须比较CSV文件，然后我必须删除所有重复的行。所以，我的条件就像我有一个文件夹，我必须将每个过滤结果放在该文件夹中，当一些新文件出现时，我必须将文件夹中的现有文件与新文件进行比较，最后，我必须把将结果返回到同一文件夹。

eg: /data/ingestion/file1.csv

   a1 b1 c1

   a2 b2 c2

   a3 b3 c3

/data/ingestion/file2.csv

   a4 b4 c4

   a5 b5 c5

   a6 b6 c6

new upcoming file(upcoming_file.csv):

   a1 b1 c1

   a5 b5 c5

   a7 b7 c7

现在我的方法是从/ data / ingestion / *中的所有文件创建一个数据帧。然后创建一个stored_file.csv的数据帧，并使用union操作附加它们。最后，应用不同的转换。现在我必须将它写回/ data / ingestion，确保没有任何重复性。所以，我选择覆盖操作。

deleted_duplicate.write
  .format("csv")
  .mode("overwrite")
  .save("hdfs://localhost:8020/data/ingestion/")

然后我最终删除了文件夹/ data / ingestion中的所有内容。即使是新的数据帧也没有用CSV文件编写。

我也尝试了其他选项，但我没有达到我上面解释的内容！

提前致谢！

Answer 1

我建议将输出写入hdfs上的新目录 - 如果处理失败，您将始终能够丢弃所处理的内容并从头开始使用原始数据启动处理 - 这样既安全又容易。：）

处理完成后 - 只需删除旧处理并将新处理重命名为旧处理名称。

<强>更新

deleted_duplicate.write
  .format("csv")
  .mode("overwrite")
  .save("hdfs://localhost:8020/data/ingestion_tmp/")

   Configuration conf = new Configuration();
    conf.set("fs.hdfs.impl",org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
    conf.set("fs.file.impl",org.apache.hadoop.fs.LocalFileSystem.class.getName());
    FileSystem  hdfs = FileSystem.get(URI.create("hdfs://<namenode-hostname>:<port>"), conf);
    hdfs.delete("hdfs://localhost:8020/data/ingestion", isRecusrive);
    hdfs.rename("hdfs://localhost:8020/data/ingestion_tmp", "hdfs://localhost:8020/data/ingestion");

Here是HDFS FileSystem API文档的链接

选择在Spark（HDFS）中编写CSV文件的选项？

1 个答案: