Question

我有两个非常大的ArrayList，每个都包含数百万个数据。我想从List1中过滤掉List2中不存在的数据，并且/或者反之亦然。

我尝试了Apache CollectionUtils，Java 8流API，但没有成功。

Java 8并行流正在消耗所有CPU，并且CollectionUtils一直在比较没有任何输出的数据集。

POJO示例

public DataVO {
 private String id;
 private String value;
 ...
 // getters / setters

 @Override
 public int hashCode() {
  final int prime = 31;
  int result = 1;
  result = (prime * result) + ((id == null) ? 0 : id.hashCode());
  return result;
 }

 @Override
 public boolean equals(final Object obj) {
  ...
  ...
  final DataVO other = (DataVO) obj;
  if (id == null) {
   if (other.id != null) {
    return false;
   }
  }
  else if (!id.equals(other.id)) {
   return false;
  }
  return true;
 }
}

hashCode（）/ equals（）可以具有更多字段，现在我保持简单。

我还尝试将List1分成较小的块，然后尝试与List2进行比较，没有任何结果。我看过其他问题，但没有一个考虑得非常大。

请让我知道您是否有任何指针。

Answer 1

您可以将ArrayList的大块读入HashSet中，例如10k个元素。确保在HashSet构造函数上设置大小。然后，对于每个块调用HashSet#RemoveAll，与另一个ArrayList一起调用。其余条目是您的答案。甚至可以与ThreadPoolExecutor并行化。

List missing = new ArrayList(); // answer

for (int i = 0; i < list1.size(); ) {
    int offset = i;
    i += 16 * 1024;
    if (i > list1.size()) i = list1.size();
    Set chunk = new HashSet(list1.subList(offset, i));

    for (int j = list2.size(); --j >= 0; chunk.remove(list2.get(j));
    missing.addAll(chunk);
}

比较大列表并提取丢失的数据

1 个答案: