仅使用外部存储器将两个文本大文件与具有Java中URL的URL进行比较?

时间:2018-12-03 17:39:58

标签: java sorting diff

我有以下情况:

  • 网址文本文件A
  • 网址文本文件B

每个文件的大小约为4Gb。

我需要计算:

  • A中所有不在B中的网址
  • B中所有不在A中的网址

我在网上找到的所有Java-diff示例都将整个列表加载到内存中(使用Map或使用MMap解决方案)。我的系统没有交换功能,并且缺少内存,没有外部存储器就可以执行此操作。

有人知道解决方案吗?

该项目可以完成大量的文件排序,而不会占用大量内存https://github.com/lemire/externalsortinginjava

我正在寻找类似的东西,但是会产生差异。我将从尝试使用该项目作为基准来实现这一点开始。

2 个答案:

答案 0 :(得分:1)

如果系统具有足够的存储空间,则可以通过DB执行此操作。例如:

创建H2或sqlite DB(存储在磁盘上的数据,分配尽可能多的数据)     缓存,因为系统可以负担得起) 在表A和B中加载文本文件(在“ URL”列上创建索引)

select url from A where URL not in (select distinct url from B)
select url from B where URL not in (select distinct url from A)

答案 1 :(得分:0)

以下是我想出的解决方案的要点:https://gist.github.com/nddipiazza/16cb2a0d23ee60a07121893c26065de4

import com.google.common.collect.Sets;
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.LineIterator;

import java.io.File;
import java.io.IOException;
import java.util.HashSet;
import java.util.Set;

public class DiffTextFilesUtil {
  static public int CHUNK_SIZE = 100000;

  static public class DiffResult {
    public Set<String> addedVals = new HashSet<>();
    public Set<String> removedVals = new HashSet<>();
  }

  /**
   * Gets diff result of two sorted files with each other.
   * @param lhs left hand file - sort this using com.google.code.externalsortinginjava:externalsortinginjava:0.2.5
   * @param rhs right hand file - sort this using com.google.code.externalsortinginjava:externalsortinginjava:0.2.5
   * @return DiffResult.addedVals were added from lhs to rhs. DiffResult.removedVals were removed from lhs to rhs.
   * @throws IOException
   */
  public static DiffResult diff(File lhs, File rhs) throws IOException {

    DiffResult diffResult = new DiffResult();

    LineIterator lhsIter = FileUtils.lineIterator(lhs);
    LineIterator rhsIter = FileUtils.lineIterator(rhs);

    String lhsTop = null;
    String rhsTop = null;
    while (lhsIter.hasNext()) {
      int ct = CHUNK_SIZE;

      Set<String> setLhs = Sets.newHashSet();
      Set<String> setRhs = Sets.newHashSet();
      while (lhsIter.hasNext() && --ct > 0) {
        lhsTop = lhsIter.nextLine();
        setLhs.add(lhsTop);
      }
      while (rhsIter.hasNext()) {
        if (rhsTop != null && rhsTop.compareTo(lhsTop) > 0) {
          break;
        } else if (rhsTop != null && rhsTop.compareTo(lhsTop) == 0) {
          setRhs.add(rhsTop);
          rhsTop = null;
          break;
        } else if (rhsTop != null) {
          setRhs.add(rhsTop);
        }
        rhsTop = rhsIter.next();
      }
      if (rhsTop != null) {
        setRhs.add(rhsTop);
      }
      Sets.difference(setLhs, setRhs).copyInto(diffResult.removedVals);
      Sets.difference(setRhs, setLhs).copyInto(diffResult.addedVals);
    }
    return diffResult;
  }
}