Question

简单的情况。我有一个列表列表，几乎像表格一样，我试图找出是否有任何列表重复。

示例：

List<List<int>> list = new List<List<int>>(){
  new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
  new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
  new List<int>() {0 ,1 ,4, 2, 4, 5, 6 },
  new List<int>() {0 ,3 ,2, 5, 1, 6, 4 }
};

我想知道总共有4个项目，其中2个是重复项目。我在考虑做SQL checksum 这样的事情，但我不知道是否有更好/更简单的方法。

我关心表现，我关心订购。

可提供帮助的其他信息

永远不会删除插入此列表的内容
不受任何特定收藏的束缚。
不关心功能签名
他们的类型不限于int

Answer 1

让我们尝试获得最佳表现。如果n是列表的数量而m是列表的长度，那么我们可以得到O（n m + n logn + n）加上一些哈希码的概率对于不同的列表是相等的。

主要步骤：

计算哈希码*
对它们进行排序
查看列表以找到欺骗

*这是重要的一步。对于simlicity，你可以将hash hash as = ... ^（list [i]＆lt;＆lt; i）^（list [i + 1]＆lt;＆lt;（i + 1））

编辑对于那些认为PLINQ可以提升这一点但却不是很好的算法的人。 PLINQ也可以在这里添加，因为所有步骤都可以轻松并行化。

我的代码：

static public void Main()
{
    List<List<int>> list = new List<List<int>>(){
      new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
      new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
      new List<int>() {0 ,1 ,4, 2, 4, 5, 6 },
      new List<int>() {0 ,3 ,2, 5, 1, 6, 4 }
    };
    var hashList = list.Select((l, ind) =>
    {
        uint hash = 0;
        for (int i = 0; i < l.Count; i++)
        {
            uint el = (uint)l[i];
            hash ^= (el << i) | (el >> (32 - i));
        }
        return new {hash, ind};
    }).OrderBy(l => l.hash).ToList();
    //hashList.Sort();
    uint prevHash = hashList[0].hash;
    int firstInd = 0;            
    for (int i = 1; i <= hashList.Count; i++)
    {
        if (i == hashList.Count || hashList[i].hash != prevHash)
        {
            for (int n = firstInd; n < i; n++)
                for (int m = n + 1; m < i; m++)
                {
                    List<int> x = list[hashList[n].ind];
                    List<int> y = list[hashList[m].ind];
                    if (x.Count == y.Count && x.SequenceEqual(y))
                        Console.WriteLine("Dupes: {0} and {1}", hashList[n].ind, hashList[m].ind);
                }                    
        }
        if (i == hashList.Count)
            break;
        if (hashList[i].hash != prevHash)
        {
            firstInd = i;
            prevHash = hashList[i].hash;
        }
    }
}

Answer 2

除非您正在做一些非常繁重的工作，否则以下简单的代码可能对您有用：

var lists = new List<List<int>>()
{
   new List<int>() {0 ,1, 2, 3, 4, 5, 6 },
   new List<int>() {0 ,1, 2, 3, 4, 5, 6 },
   new List<int>() {0 ,1, 4, 2, 4, 5, 6 },
   new List<int>() {0 ,3, 2, 5, 1, 6, 4 }
};

var duplicates = from list in lists
                 where lists.Except(new[] { list }).Any(l => l.SequenceEqual(list))
                 select list;

显然，如果你手动调整一个算法，你就可以获得更好的性能，这样你就不必每次迭代都扫描列表，但是有一些东西可以用来编写声明性的，更简单的代码。

（另外，由于LINQ®的Awesomeness，通过在上面的代码中添加.AsParallel（）调用，该算法将在多个内核上运行，因此运行速度可能比此处提到的复杂的手动调整解决方案更快线程。）

Answer 3

您必须至少迭代一次每个列表的每个索引，但是您可以通过创建自定义哈希表来加速该过程，这样您就可以快速拒绝非重复列表而无需进行比较。项目

算法：

Create a custom hashtable (dictionary: hash -> list of lists)
For each list
  Take a hash of the list (one that takes order into account)
  Search in hashtable
  If you find matches for the hash
    For each list in the hash entry, re-compare the tables
      If you find a duplicate, return true
  Else if you don't find matches for the hash
    Create a temp list
    Append the current list to our temp list
    Add the temp list to the dictionary as a new hash entry
You didn't find any duplicates, so return false

如果您的输入数据具有足够强的哈希算法，您甚至可能不需要进行子比较，因为不存在任何哈希冲突。

我有一些示例代码。丢失的位是：

优化，以便我们每个列表只进行一次字典查找（用于搜索和插入）。可能必须创建自己的Dictionary / Hash Table类才能执行此操作吗？
一种更好的散列算法，可以通过对数据进行分析来找到它们

以下是代码：

public bool ContainsDuplicate(List<List<int>> input)
{
    var encounteredLists = new Dictionary<int, List<EnumerableWrapper>>();

    foreach (List<int> currentList in input)
    {
        var currentListWrapper = new EnumerableWrapper(currentList);
        int hash = currentListWrapper.GetHashCode();

        if (encounteredLists.ContainsKey(hash))
        {
            foreach (EnumerableWrapper currentEncounteredEntry in encounteredLists[hash])
            {
                if (currentListWrapper.Equals(currentEncounteredEntry))
                    return true;
            }
        }
        else
        {
            var newEntry = new List<EnumerableWrapper>();
            newEntry.Add(currentListWrapper);
            encounteredLists[hash] = newEntry;
        }
    }

    return false;
}

sealed class EnumerableWrapper
{
    public EnumerableWrapper(IEnumerable<int> list)
    {
        if (list == null)
            throw new ArgumentNullException("list");
        this.List = list;
    }

    public IEnumerable<int> List { get; private set; }

    public override bool Equals(object obj)
    {
        bool result = false;

        var other = obj as EnumerableWrapper;
        if (other != null)
            result = Enumerable.SequenceEqual(this.List, other.List);

        return result;
    }

    public override int GetHashCode()
    {
        // Todo: Implement your own hashing algorithm here
        var sb = new StringBuilder();
        foreach (int value in List)
            sb.Append(value.ToString());
        return sb.ToString().GetHashCode();
    }
}

Answer 4

这样的事情会给你正确的结果：

List<List<int>> list = new List<List<int>>(){
  new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
  new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
  new List<int>() {0 ,1 ,4, 2, 4, 5, 6 },
  new List<int>() {0 ,3 ,2, 5, 1, 6, 4 }
};

list.ToLookup(l => String.Join(",", l.Select(i => i.ToString()).ToArray()))
    .Where(lk => lk.Count() > 1)
    .SelectMany(group => group);

Answer 5

这是一个潜在的想法（这假设值是数字的）：

实现一个比较器，将每个集合的每个成员乘以其索引，然后对整个事物求和：

Value:    0  5  8  3  2  0  5  3  5  1
Index:    1  2  3  4  5  6  7  8  9  10
Multiple: 0  10 24 12 10 0  35 24 45 10

会员CheckSum：170

因此，整个“行”有一个随成员和排序而变化的数字。快速计算和比较。

Answer 6

如果它们都是单个数字并且具有相同数量的元素，则可以将它们放在一起，因此第一个是123456并检查数字是否相同。

然后你会有一个清单{123456,123456,142456,325164}

更容易检查重复项，如果单个成员可以超过10个，则必须修改它。

编辑：添加示例代码，可以进行优化，只是一个简单的例子来解释我的意思。

for(int i = 0; i< list.length; i++)
{
    List<int> tempList = list[i];
    int temp = 0;
    for(int j = tempList.length - 1;i > = 0; j--)
    {
        temp = temp * 10 + tempList[j];
    }
    combinded.add(temp);
}

for(int i =0; i< combined.length; i++)
{
    for(int j = i; j < combined.length; j++)
    {
        if(combined[i] == combined[j])
        {
            return true;
        }
    }
}
return false;

Answer 7

如果重复项非常罕见或非常常见，您也可以尝试使用概率算法。例如一个bloom filter

Answer 8

编写自己的列表比较器怎么样：

class ListComparer:IEqualityComparer<List<int>>
{
     public bool Equals(List<int> x, List<int> y)
     {
        if(x.Count != y.Count)
          return false;

        for(int i = 0; i < x.Count; i++)
          if(x[i] != y[i])
             return false;

       return true;
     }

     public int GetHashCode(List<int> obj)
     {
        return base.GetHashCode();
     }
}

然后只是：

var nonDuplicatedList = list.Distinct(new ListComparer());
var distinctCount = nonDuplicatedList.Count();

Answer 9

这里有很多好的解决方案，但我相信这个解决方案将始终以最快的速度运行，除非存在一些您尚未告诉我们的数据结构。

创建从整数键到List的映射，以及从键到List<List<int>>
对于每个List<int>，使用一些简单函数计算哈希值，例如(...((x0)*a + x1)*a + ...)*a + xN)，您可以递归计算; a应该像1367130559（即一些大的素数，它随机不接近2的任何有趣的力量。
添加哈希及其来自的列表作为键值对（如果它不存在）。如果确实存在，请查看第二个地图。如果第二个映射具有该键，则将新List<int>附加到累积列表。如果没有，请从第一张地图中查找的List<int>和您正在测试的List<int>，并在第二张地图中添加一个新条目，其中包含这两个项目的列表。
重复直到您完成整个第一个列表。现在你有一个带有潜在碰撞列表的散列图（第二个映射），以及带有一个键列表的散列映射（第一个映射）。
遍历第二张地图。对于每个条目，取其中的List<List<int>>并按字典顺序对其进行排序。现在只需通过进行相等比较来计算不同块的数量。
您的总项目数等于原始列表的长度。
您的第一个散列映射的大小加上第二个散列映射中每个条目的总和（块数 - 1）。
您重复的项目数量是这两个数字的差异（如果需要，您可以找到各种其他内容）。

如果你有N个非重复项，并且M个条目与一组K项重复，那么你将需要O（N + M + 2K）来创建初始哈希映射，最差的是O （M log M）进行排序（可能更像是O（M log（M / K）））和O（M）进行最终的相等测试。

Answer 10

结帐C# 3.0: Need to return duplicates from a List<>它会向您展示如何从列表中返回重复项。

该页面的示例：

var duplicates = from car in cars
             group car by car.Color into grouped
             from car in grouped.Skip(1)
             select car;

在列表列表中查找重复项

10 个答案: