Question

就我而言，我需要比较两个大型HashSet以使用removeAll查找差异。为此，我必须将来自不同数据源的所有数据都带入内存然后进行比较。当每个HashSet可能包含超过300万条记录时，这会产生Out of Memory问题。有没有什么方法或库可以减少内存消耗，但也可以获得相同的结果？

Answer 1

请注意，如果对数据进行排序，您可以使用非常少量的额外内存在一次传输数据时进行数据处理：

i <- 0
j <- 0
while i < list1.size() and j < list2.size():
    if list1[i] == list2[j]:
        i <- i+1
        j <- j+1
    else if list1[i] < list2[j]: //i definetly not in list2
        yield list[i]
        i <- i+1
    else: // j is not in list1
        yield list[j]
        j <- j+1
yield all elements in list1 from i to list1.size() if there is any
yield all elements in list2 from j to list2.size() if there is any

使用散列的另一种替代方法是只需要加载一个列表（假设这里的数据是集合，如问题所述，因此不需要进行欺骗处理）：

load list1 as hash1
for each x in list2:
    if x is in hash1:
         hash1.remove(x)
    else:
         yield x
yield all remaining elements in hash1

请注意，如果一个列表也不适合内存，您可以拆分数据并迭代执行第二种方法。

Answer 2

您的描述所需要的是在数据库中使用的散列连接：What is the difference between a hash join and a merge join (Oracle RDBMS )?

简而言之，为了减少内存消耗，您可以使用哈希值对数据进行分区。非常基本的示例：从两个集合中获取哈希帧，即在某些值h1和h2之间进行哈希处理并比较它们。然后将对象与h2和h3之间的哈希值进行比较等。这些h1，h2，... hN可能很容易找到< / p>

h[i] = i * ((long) Integer.MAX_VALUE - Integer.MIN_VALUE) / N;

与否 - 它取决于您拥有的数据和散列函数。

此解决方案需要O(DB_SIZE / N)内存和O(DB_SIZE * N)记录提取操作。因此，当N = 4时，这将扫描数据库4次，并减少4次内存消耗。

Answer 3

您可以先按MyRecord.hashCode()中的非常见Set<Integer>过滤掉然后使用Set<MyRecord>。

// Determine common hashCodes:

Set<Integer> hashCodes = new HashSet<>();
for (MyRecord record : readFirstTable()) {
    hashCodes.add(record.hashCode();
}

Set<Integer> commonHashCodes = new HashSet<>();
for (MyRecord record : readSecondTable()) {
    int hashCode = record.hashCode();
    if (hashCodes.remove(hashCode)) {
        commonHashCodes.add(hashCode);
    }
}
hashCodes = null;

// Determine common records:

Set<MyRecord> records = new HashSet<>();
for (MyRecord record : readFirstTable()) {
    if (commonHashCodes.contains(record.hashCode()) {
        records.add(record);
    }
}
Set<MyRecord> commonRecords = new HashSet<>();
for (MyRecord record : readSecondTable()) {
    if (records.remove(record) {
        commonRecords.add(record);
    }
}
commonHashCodes = null;
records = null;
return commonRecords;

在Java中查找两个大组之间差异的有效方法

3 个答案: