我有以下情况:
每个文件的大小约为4Gb。
我需要计算:
我在网上找到的所有Java-diff示例都将整个列表加载到内存中(使用Map或使用MMap解决方案)。我的系统没有交换功能,并且缺少内存,没有外部存储器就可以执行此操作。
有人知道解决方案吗?
该项目可以完成大量的文件排序,而不会占用大量内存https://github.com/lemire/externalsortinginjava
我正在寻找类似的东西,但是会产生差异。我将从尝试使用该项目作为基准来实现这一点开始。
答案 0 :(得分:1)
如果系统具有足够的存储空间,则可以通过DB执行此操作。例如:
创建H2或sqlite DB(存储在磁盘上的数据,分配尽可能多的数据) 缓存,因为系统可以负担得起) 在表A和B中加载文本文件(在“ URL”列上创建索引)
select url from A where URL not in (select distinct url from B)
select url from B where URL not in (select distinct url from A)
答案 1 :(得分:0)
以下是我想出的解决方案的要点:https://gist.github.com/nddipiazza/16cb2a0d23ee60a07121893c26065de4
import com.google.common.collect.Sets;
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.LineIterator;
import java.io.File;
import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
public class DiffTextFilesUtil {
static public int CHUNK_SIZE = 100000;
static public class DiffResult {
public Set<String> addedVals = new HashSet<>();
public Set<String> removedVals = new HashSet<>();
}
/**
* Gets diff result of two sorted files with each other.
* @param lhs left hand file - sort this using com.google.code.externalsortinginjava:externalsortinginjava:0.2.5
* @param rhs right hand file - sort this using com.google.code.externalsortinginjava:externalsortinginjava:0.2.5
* @return DiffResult.addedVals were added from lhs to rhs. DiffResult.removedVals were removed from lhs to rhs.
* @throws IOException
*/
public static DiffResult diff(File lhs, File rhs) throws IOException {
DiffResult diffResult = new DiffResult();
LineIterator lhsIter = FileUtils.lineIterator(lhs);
LineIterator rhsIter = FileUtils.lineIterator(rhs);
String lhsTop = null;
String rhsTop = null;
while (lhsIter.hasNext()) {
int ct = CHUNK_SIZE;
Set<String> setLhs = Sets.newHashSet();
Set<String> setRhs = Sets.newHashSet();
while (lhsIter.hasNext() && --ct > 0) {
lhsTop = lhsIter.nextLine();
setLhs.add(lhsTop);
}
while (rhsIter.hasNext()) {
if (rhsTop != null && rhsTop.compareTo(lhsTop) > 0) {
break;
} else if (rhsTop != null && rhsTop.compareTo(lhsTop) == 0) {
setRhs.add(rhsTop);
rhsTop = null;
break;
} else if (rhsTop != null) {
setRhs.add(rhsTop);
}
rhsTop = rhsIter.next();
}
if (rhsTop != null) {
setRhs.add(rhsTop);
}
Sets.difference(setLhs, setRhs).copyInto(diffResult.removedVals);
Sets.difference(setRhs, setLhs).copyInto(diffResult.addedVals);
}
return diffResult;
}
}