Question

我有List个人，我希望找到重复的条目，包含id以外的所有字段。因此，使用equals() - 方法（结果为List.contains()），因为它们会考虑id。

public class Person {
    private String firstname, lastname;
    private int age;
    private long id;
}

修改equals()和hashCode() - 方法以忽略id字段不是一种选择，因为代码的其他部分依赖于此。

如果我想忽略id字段，那么在Java中解决重复项的最有效方法是什么？

Answer 1

构建Comparator<Person>以实现自然密钥排序，然后使用基于二进制搜索的重复数据删除。 TreeSet将为您提供开箱即用的此功能。

请注意，Comparator<T>.compare(a, b) must fulfil通常的反对称性，传递性，一致性和反身性要求或二进制搜索顺序将失败。您还应该使其识别为null（例如，如果一个，另一个或两者的firstname字段为null）。

Person类的一个简单的自然键比较器如下（如果你有每个字段的访问器，它是一个静态成员类，你没有显示。）

public class Person {
    public static class NkComparator implements Comparator<Person>
    {
        public int compare(Person p1, Person p2)
        {
            if (p1 == null || p2 == null) throw new NullPointerException();
            if (p1 == p2) return 0;
            int i = nullSafeCompareTo(p1.firstname, p2.firstname);
            if (i != 0) return i;
            i = nullSafeCompareTo(p1.lastname, p2.lastname);
            if (i != 0) return i;
            return p1.age - p2.age;
        }
        private static int nullSafeCompareTo(String s1, String s2)
        {
            return (s1 == null)
                    ? (s2 == null) ? 0 : -1
                    : (s2 == null) ? 1 : s1.compareTo(s2);
        }
    }
    private String firstname, lastname;
    private int age;
    private long id;
}

然后，您可以使用它来生成唯一列表。使用add方法返回true当且仅当该元素在该集合中不存在时才会返回：

List<Person> newList = new ArrayList<Person>();
TreeSet<Person> nkIndex = new TreeSet<Person>(new Person.NkComparator());
for (Person p : originalList)
    if (nkIndex.add(p)) newList.add(p); // to generate a unique list

或交换此行的最后一行以输出重复项

    if (nkIndex.add(p)) newList.add(p);

无论您做什么，在枚举时都不要在原始列表中使用remove，这就是为什么这些方法会将您的唯一元素添加到新列表中。

如果您只对一个唯一的列表感兴趣，并希望使用尽可能少的行：

TreeSet<Person> set = new TreeSet<Person>(new Person.NkComparator());
set.addAll(originalList);
List<Person> newList = new ArrayList<Person>(set);

Answer 2

在评论中建议 @LuiggiMendoza ：

您可以创建一个自定义Comparator类，用于比较两个Person个对象是否相等，忽略它们的ID。

class PersonComparator implements Comparator<Person> {

    // wraps the compareTo method to compare two Strings but also accounts for NPE
    int compareStrings(String a, String b) {
        if(a == b) {           // both strings are the same string or are null
          return 0;
        } else if(a == null) { // first string is null, result is negative
            return -1;
        } else if(b == null){  // second string is null, result is positive
            return 1;
        } else {               // no strings are null, return the result of compareTo
            return a.compareTo(b);
        }
    }

    @Override
    public int compare(Person p1, Person p2) {

        // comparisons on Person objects themselves
        if(p1 == p2) {                 // Person 1 and Person 2 are the same Person object
            return 0;
        }
        if(p1 == null && p2 != null) { // Person 1 is null and Person 2 is not, result is negative
            return -1;
        }
        if(p1 != null && p2 == null) { // Person 1 is not null and Person 2 is, result is positive
            return 1;
        }

        int result = 0;

        // comparisons on the attributes of the Persons objects
        result = compareStrings(p1.firstname, p2.firstname);
        if(result != 0) {   // Persons differ in first names, we can return the result
            return result;
        }
        result = compareStrings(p1.lastname, p2.lastname);
        if(result != 0) {  // Persons differ in last names, we can return the result
            return result;
        }

        return Integer.compare(p1.age, p2.age); // if both first name and last names are equal, the comparison difference is in their age
    }
}

现在，您可以将TreeSet结构与此自定义Comparator结合使用，例如，创建一个消除重复值的简单方法。

List<Person> getListWithoutDups(List<Person> list) {
    List<Person> newList = new ArrayList<Person>();
    TreeSet<Person> set = new TreeSet<Person>(new PersonComparator()); // use custom Comparator here

    // foreach Person in the list
    for(Person person : list) {
        // if the person isn't already in the set (meaning it's not a duplicate)
        // add it to the set and the new list
        if(!set.contains(person)) {
            set.add(person);
            newList.add(person);
        }
        // otherwise it's a duplicate so we don't do anything
    }

    return newList;
}

contains，as the documentation says，＆＃34;中的TreeSet操作可提供有保证的log（n）时间费用＆＃34; 。

我上面建议的方法需要O(n*log(n))时间，因为我们对每个列表元素执行contains操作，但它也使用O(n)空间来创建新列表和{{1 }}。

如果您的列表非常大（空间非常重要），但处理速度不是问题，那么您可以删除找到的每个重复项，而不是将每个非重复项添加到列表中：

TreeSet

由于列表上的每个List<Person> getListWithoutDups(List<Person> list) { TreeSet<Person> set = new TreeSet<Person>(new PersonComparator()); // use custom Comparator here Person person; // for every Person in the list for(int i = 0; i < list.size(); i++) { person = list.get(i); // if the person is already in the set (meaning it is a duplicate) // remove it from the list if(set.contains(person) { list.remove(i); i--; // make sure to accommodate for the list shifting after removal } // otherwise add it to the set of non-duplicates else { set.add(person); } } return list; }操作都需要remove时间（因为每次删除元素时列表都会移动），并且每个O(n)操作需要contains次，这种方法将及时log(n)。

但是，由于我们只创建O(n^2 log(n))而不是第二个列表，因此空间复杂度将减半。

Answer 3

我建议不要使用Comparator来执行此操作。基于其他字段编写合法的compare()方法非常困难。

我认为更好的解决方案是创建一个类PersonWithoutId，如下所示：

public PersonWithoutId {
  private String firstname, lastname;
  private int age;
  // no id field
  public PersonWithoutId(Person original) { /* copy fields from Person */ }
  @Overrides public boolean equals() { /* compare these 3 fields */ }
  @Overrides public int hashCode() { /* hash these 3 fields */ }
}

然后，如果List<Person>被称为people，您可以执行此操作：

Set<PersonWithoutId> set = new HashSet<>();
for (Iterator<Person> i = people.iterator(); i.hasNext();) 
    if (!set.add(new PersonWithoutId(i.next())))
        i.remove();

修改

正如其他人在评论中指出的那样，这个解决方案并不理想，因为它为垃圾收集器创建了一大堆对象来处理。但是，与使用Comparator和TreeSet的解决方案相比，此解决方案更快。按顺序保持Set需要时间，而且与原始问题无关。我在使用
构建的1,000,000个List实例的Person上测试了这个
new Person( "" + rand.nextInt(500), // firstname "" + rand.nextInt(500), // lastname rand.nextInt(100), // age rand.nextLong()) // id

并且发现此解决方案的速度大约是使用TreeSet的解决方案的两倍。（不可否认，我使用了System.nanoTime()而不是正确的基准测试。）

那么如何在不创建大量不必要对象的情况下有效地做到这一点呢？ Java并不容易。一种方法是在Person
中编写两个新方法
boolean equalsIgnoringId(Person other) { ... } int hashCodeIgnoringId() { ... }

然后编写Set<Person>的自定义实现，除了将HashSet和equals()替换为hashCode()之外，您基本上剪切并粘贴equalsIgnoringId()的代码和hashCodeIgnoringId()。

我个人认为，您可以使用TreeSet而不是Comparator创建使用自定义版HashSet / {{equals的{{1}} 1}}是语言中的严重缺陷。

Answer 4

您可以使用HashMap对使用Java <K,V>。 Map<K,V> map = new HashMap<K,V>()。此外，还有某种形式的Comparator实现。如果你检查containsKey或containsValue方法，并发现你已经有了东西（即你试图添加一个副本，请将它们保存在原始列表中。否则，将它们弹出。这样，你最终会得到一个列表在原始列表中重复的元素.TreeSet＆lt;，＆gt;将是另一种选择，但我还没有使用它，所以无法提供建议。

在列表中查找重复项忽略字段

4 个答案: