Question

我需要在Java中使用两个文件的set difference。文件各有大约5000万行，所以我无法将它们完全加载到内存中。我可以做那个阶段，但我打算使用linux中的comm命令，这样做有效。

java中是否有一个库可以有效地完成这项工作？
从程序中调用shell命令是不好的设计？

详情

我有file1和file2，每个都有超过4000万行。我不想让它们适合记忆。我需要找到file1 - file2的设置差异。即file1中但不在file2中的行。一般来说，我会跟随算法：

 1. Read file1 line by line and save it in HashSet.
 2. Read file2 line by line.
 3. Remove each line of file2 from Hashset if present

如果不在Hashset中保存file1，我有什么办法可以做到这一点。

编辑：我的解决方案

我终于决定用bloom来解决目的了。我知道布隆过滤器给出了近似的答案，但我已经将bitset长度足够长*（14 *大小的file1，即10Million）*，这给了我10 ^ -9的准确度。以下是算法

 1. Read each line of file2 and add to Bloom Filter.
 2. Now, file2 is compressed from 300MB+ to 40MB+
 3. Read each line of file1, if not present in filter print the line

Answer 1

使用shell脚本会为应用程序添加额外的依赖项，也可能使您的应用程序依赖于平台。例如。在没有comm的操作系统上。

您是否尝试InputStream来处理文件？它不会将整个内容加载到内存中。如果comm执行您需要的操作，则意味着您只想逐行进行差异操作，您可以尝试InputStream。

注意，如果您要使用comm，则应确保您的文件已经排序。

Answer 2

1.您可以使用ProccessBuilder对象调用cmd命令 2.在我看来，有更有效的方法（批处理文件e.t.c）

从Java调用shell命令是不是很糟糕？

2 个答案: