Question

在短时间内比较两个大型电子邮件地址列表的最佳方法或算法是什么？

想法是检测列表B中可以找到列表中的多个地址。

列表不相等。我尝试过模糊校验和，但只有当列表大小相同时才会有好处（在我的情况下，列表不相等）。

我认为是一个Hadoop解决方案，但不幸的是我是Hadoop的初学者。有没有人有任何想法，例子，解决方案，教程？

由于

Answer 1

如果您将每个列表视为一个集合，则公共地址由集合intersection表示。独特的＆＃39;地址（仅显示在一个地址）表示为：

set1 U set2 \ (set1 [intersection] set2)

可以在所有高级语言（如java）中轻松完成，例如查看apache CollectionUtils.intersection()。

如果列表不是太大（适合内存），它可以在内存中完成如下（java代码）：

    //first two lines are just for testing, not part of the algorithm:
    List<String> l1 = Arrays.asList(new String[] { "a@b.com", "1@2.com"} );
    List<String> l2 = Arrays.asList(new String[] { "1@2.com", "asd@f.com", "qwer@ty.com"} );
    Set<String> s1 = new HashSet<String>(l1);
    for (String s : l2) {
        if (s1.contains(s)) System.out.println(s);
    }

如果您想使用 hadoop ，可以通过以下方式实现常见邮件：

map(set):
   for each mail in list:
         emit(mail,'1')
reduce(mail,list<1>):
    if size(list) > 1:
       emit(mail)

通过在两个集合上调用map，并减少mapper的输出，你将获得公共元素。

Answer 2

这能为你完成这项工作吗？应该是O（n）。

Create an empty hash set for the intersection with a hash function that doesn't collide over email addresses
Create an empty hash set for the first difference hash set with a similar hash function
Create an empty hash set for the second difference hash set with a similar hash function
Iterate through the first list:
    Add the current element to the first difference hash set
End Iterate
Iterate through the second list:
    If the current element exists in the intersection hash set:
        Remove the current element from the first difference hash set
        Remove the current element from the second difference hash set
    Else If the current element exists in the first difference hash set:
        Remove the current element from the first difference hash set
        Remove the current element from the second difference hash set
        Add the current element to the intersection hash set
    Else:
        Add the current element to the second difference hash set
    End If
End Iterate
Process the intersection hash set as the solution

它的好处在于为您提供交叉和差异。它可以扩展为跟踪任意数量的列表之间的差异。

哪个是比较两个大型电子邮件地址列表的最佳方法或算法？

2 个答案: