Question

我有以下情况：

网址文本文件A
网址文本文件B

每个文件的大小约为4Gb。

我需要计算：

A中所有不在B中的网址
B中所有不在A中的网址

我在网上找到的所有Java-diff示例都将整个列表加载到内存中（使用Map或使用MMap解决方案）。我的系统没有交换功能，并且缺少内存，没有外部存储器就可以执行此操作。

有人知道解决方案吗？

该项目可以完成大量的文件排序，而不会占用大量内存https://github.com/lemire/externalsortinginjava

我正在寻找类似的东西，但是会产生差异。我将从尝试使用该项目作为基准来实现这一点开始。

Answer 1

如果系统具有足够的存储空间，则可以通过DB执行此操作。例如：

创建H2或sqlite DB（存储在磁盘上的数据，分配尽可能多的数据）缓存，因为系统可以负担得起）在表A和B中加载文本文件（在“ URL”列上创建索引）

select url from A where URL not in (select distinct url from B)
select url from B where URL not in (select distinct url from A)

Answer 2

以下是我想出的解决方案的要点：https://gist.github.com/nddipiazza/16cb2a0d23ee60a07121893c26065de4

import com.google.common.collect.Sets;
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.LineIterator;

import java.io.File;
import java.io.IOException;
import java.util.HashSet;
import java.util.Set;

public class DiffTextFilesUtil {
  static public int CHUNK_SIZE = 100000;

  static public class DiffResult {
    public Set<String> addedVals = new HashSet<>();
    public Set<String> removedVals = new HashSet<>();
  }

  /**
   * Gets diff result of two sorted files with each other.
   * @param lhs left hand file - sort this using com.google.code.externalsortinginjava:externalsortinginjava:0.2.5
   * @param rhs right hand file - sort this using com.google.code.externalsortinginjava:externalsortinginjava:0.2.5
   * @return DiffResult.addedVals were added from lhs to rhs. DiffResult.removedVals were removed from lhs to rhs.
   * @throws IOException
   */
  public static DiffResult diff(File lhs, File rhs) throws IOException {

    DiffResult diffResult = new DiffResult();

    LineIterator lhsIter = FileUtils.lineIterator(lhs);
    LineIterator rhsIter = FileUtils.lineIterator(rhs);

    String lhsTop = null;
    String rhsTop = null;
    while (lhsIter.hasNext()) {
      int ct = CHUNK_SIZE;

      Set<String> setLhs = Sets.newHashSet();
      Set<String> setRhs = Sets.newHashSet();
      while (lhsIter.hasNext() && --ct > 0) {
        lhsTop = lhsIter.nextLine();
        setLhs.add(lhsTop);
      }
      while (rhsIter.hasNext()) {
        if (rhsTop != null && rhsTop.compareTo(lhsTop) > 0) {
          break;
        } else if (rhsTop != null && rhsTop.compareTo(lhsTop) == 0) {
          setRhs.add(rhsTop);
          rhsTop = null;
          break;
        } else if (rhsTop != null) {
          setRhs.add(rhsTop);
        }
        rhsTop = rhsIter.next();
      }
      if (rhsTop != null) {
        setRhs.add(rhsTop);
      }
      Sets.difference(setLhs, setRhs).copyInto(diffResult.removedVals);
      Sets.difference(setRhs, setLhs).copyInto(diffResult.addedVals);
    }
    return diffResult;
  }
}

仅使用外部存储器将两个文本大文件与具有Java中URL的URL进行比较？

2 个答案: