有两个非常大的列表/集合-如何有效地检测和/或删除重复项

时间:2018-06-25 05:36:32

标签: c# algorithm lookup

为什么

在两个列表之间进行比较和重复数据删除时,编码人员在时间紧迫的情况下往往找不到最有效的运行时实现。对于许多编码器,两个嵌套的for循环是常见的goto解决方案。可以尝试使用LINQ进行CROSS JOIN,但这显然效率不高。为此,编码人员需要一种令人难忘且代码高效的方法,同时也要相对节省运行时间。

这个问题是在看到一个更具体的问题之后创建的:Delete duplicates in a single dataset relative to another one in C#-它在使用数据集方面更加专业。术语“数据集”将来不会对人们有所帮助。找不到其他一般性问题。

什么

我使用术语“列表/集合”来解决这个更通用的编码问题。

var setToDeduplicate = new List<int>() { 1,2,3,4,5,6,7,8,9,10,11,.....}; //All integer values 1-1M 

var referenceSet = new List<int>() { 1,3,5,7,9,....}; //All odd integer values 1-1M

var deduplicatedSet = deduplicationFunction(setToDeduplicate, referenceSet);

通过实施deduplicationFunction函数,输入数据和输出应清晰可见。输出可以是IEnumerable。此输入示例中的预期输出将是1-1M {2,4,6,8,...}

中的偶数

注意:referenceSet中可能有重复项。两组中的值仅是指示性的,因此我不希望找到数学解决方案-这也适用于两组中的随机数输入。

如果使用简单的LINQ函数进行处理,它将太慢O(1M * 0.5M)。对于如此大的集合,需要一种更快的方法。

速度很重要,但是随着大量代码的增加而进行的改进的价值将降低。同样,理想情况下,它也适用于其他数据类型,包括数据模型对象,但回答此特定问题应该就足够了。其他数据类型将只涉及更多的预处理或对答案的轻微更改。

解决方案摘要

这是测试代码,其结果如下:

using System;
using System.Collections.Generic;
using System.Data;
using System.Diagnostics;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace Test
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("Preparing...");

            List<int> set1 = new List<int>();
            List<int> set2 = new List<int>();

            Random r = new Random();
            var max = 10000;

            for (int i = 0; i < max; i++)
            {
                set1.Add(r.Next(0, max));
                set2.Add(r.Next(0, max/2) * 2);
            }

            Console.WriteLine("First run...");

            Stopwatch sw = new Stopwatch();
            IEnumerable<int> result;
            int count;

            while (true)
            {
                sw.Start();
                result = deduplicationFunction(set1, set2);
                var results1 = result.ToList();
                count = results1.Count;
                sw.Stop();
                Console.WriteLine("Dictionary and Where - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
                sw.Reset();


                sw.Start();
                result = deduplicationFunction2(set1, set2);
                var results2 = result.ToList();
                count = results2.Count;
                sw.Stop();
                Console.WriteLine("  HashSet ExceptWith - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
                sw.Reset();

                sw.Start();
                result = deduplicationFunction3(set1, set2);
                var results3 = result.ToList();
                count = results3.Count;
                sw.Stop();
                Console.WriteLine("     Sort Dual Index - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
                sw.Reset();

                sw.Start();
                result = deduplicationFunction4(set1, set2);
                var results4 = result.ToList();
                count = results3.Count;
                sw.Stop();
                Console.WriteLine("Presorted Dual Index - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
                sw.Reset();


                set2.RemoveAt(set2.Count - 1); //Remove the last item, because it was added in the 3rd test

                sw.Start();
                result = deduplicationFunction5(set1, set2);
                var results5 = result.ToList();
                count = results5.Count;
                sw.Stop();
                Console.WriteLine("        Nested Index - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
                sw.Reset();


                Console.ReadLine();

                Console.WriteLine("");
                Console.WriteLine("Next Run");
                Console.WriteLine("");
            }

        }


        //Returns an IEnumerable from which more can be chained or simply terminated with ToList by the caller
        static IEnumerable<int> deduplicationFunction(List<int> Set, List<int> Reference)
        {
            //Create a hashset first, which is much more efficient for searching
            var ReferenceHashSet = Reference
                                .Distinct() //Inserting duplicate keys in a dictionary will cause an exception
                                .ToDictionary(x => x, x => x); //If there was a ToHashSet function, that would be nicer

            int throwAway;
            return Set.Distinct().Where(y => ReferenceHashSet.TryGetValue(y, out throwAway) == false);
        }

        //Returns an IEnumerable from which more can be chained or simply terminated with ToList by the caller
        static IEnumerable<int> deduplicationFunction2(List<int> Set, List<int> Reference)
        {
            //Create a hashset first, which is much more efficient for searching
            var SetAsHash = new HashSet<int>();

            Set.ForEach(x =>
            {
                if (SetAsHash.Contains(x))
                    return;

                SetAsHash.Add(x);
            }); // .Net 4.7.2 - ToHashSet will reduce this code to a single line.

            SetAsHash.ExceptWith(Reference); // This is ultimately what we're testing

            return SetAsHash.AsEnumerable();
        }

        static IEnumerable<int> deduplicationFunction3(List<int> Set, List<int> Reference)
        {
            Set.Sort();
            Reference.Sort();
            Reference.Add(Set[Set.Count - 1] + 1); //Ensure the last set item is non-duplicate for an In-built stop clause. This is easy for int list items, just + 1 on the last item.

            return deduplicationFunction4(Set, Reference);
        }

        static IEnumerable<int> deduplicationFunction4(List<int> Set, List<int> Reference)
        {
            int i1 = 0;
            int i2 = 0;
            int thisValue = Set[i1];
            int thisReference = Reference[i2];
            while (true)
            {
                var difference = thisReference - thisValue;

                if (difference < 0)
                {
                    i2++; //Compare side is too low, there might be an equal value to be found
                    if (i2 == Reference.Count)
                        break;
                    thisReference = Reference[i2];
                    continue;
                }

                if (difference > 0) //Duplicate
                    yield return thisValue;

                GoFurther:
                i1++;
                if (i1 == Set.Count)
                    break;
                if (Set[i1] == thisValue) //Eliminates duplicates
                    goto GoFurther; //I rarely use goto statements, but this is a good situation

                thisValue = Set[i1];
            }
        }

        static IEnumerable<int> deduplicationFunction5(List<int> Set, List<int> Reference)
        {
            var found = false;
            var lastValue = 0;
            var thisValue = 0;
            for (int i = 0; i < Set.Count; i++)
            {
                thisValue = Set[i];

                if (thisValue == lastValue)
                    continue;

                lastValue = thisValue;

                found = false;
                for (int x = 0; x < Reference.Count; x++)
                {
                    if (thisValue != Reference[x])
                        continue;

                    found = true;
                    break;
                }

                if (found)
                    continue;

                yield return thisValue;
            }
        }
    }
}

我将使用它来比较多种方法的性能。 (尽管ExceptWith启用了简洁的解决方案,但我在这个阶段对散列方法与分类索引双重索引特别感兴趣)

到目前为止,已完成1万个项目的结果(良好运行):

首次运行

  • 字典和位置-计数:3565,毫秒:16.38。
  • HashSet ExceptWith-计数:3565,毫秒:5.33。
  • 排序双索引-计数:3565,毫秒:6.34。
  • 预分类双重索引-计数:3565,毫秒:1.14。
  • 嵌套索引-计数:3565,毫秒:964.16。

好运

  • 字典和位置-计数:3565,毫秒:1.21。
  • HashSet ExceptWith-计数:3565,毫秒:0.94。
  • 排序双索引-计数:3565,毫秒:1.09。
  • 预分类双索引-计数:3565,毫秒:0.76。
  • 嵌套索引-计数:3565,毫秒:628.60。

选择答案:

  • @backs HashSet.ExceptWith方法-用最少的代码略快一些,使用了有趣的函数ExceptWith,但是由于缺乏通用性而被削弱了,并且有趣的函数鲜为人知。 li>
  • 我的答案之一:HashSet> Where(.. Contains ..)-仅比@backs慢一点,但使用的代码模式使用LINQ,并且在原始元素列表之外非常通用。我相信这是我在编码时遇到的更常见的情况,并且相信许多其他编码人员都是这种情况。
  • 特别感谢@TheGeneral对一些答案和一些有趣的unsafe版本进行了基准测试,并帮助@Backs回答了更有效的后续测试。

4 个答案:

答案 0 :(得分:1)

使用HashSet作为初始列表,并使用HTMLTableElement方法获取结果设置:

var rows = (document.getElementById('associatedEmailsTable') as HTMLTableElement).rows;

答案 1 :(得分:1)

还有更多,基本上我想针对各种解决方案测试不同的输入和不同的输入。在非区别版本中,我必须在最终输出中需要的地方调用区别。

Mode             : Release (64Bit)
Test Framework   : .NET Framework 4.7.1

Operating System : Microsoft Windows 10 Pro
Version          : 10.0.17134

CPU Name         : Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
Description      : Intel64 Family 6 Model 58 Stepping 9

Cores (Threads)  : 4 (8)      : Architecture  : x64
Clock Speed      : 3901 MHz   : Bus Speed     : 100 MHz
L2Cache          : 1 MB       : L3Cache       : 8 MB

Benchmarks Runs : Inputs (1) * Scales (5) * Benchmarks (6) * Runs (100) = 3,000

结果不同的输入

--- Random Set 1 ---------------------------------------------------------------------
| Value         |   Average |   Fastest |      Cycles |    Garbage | Test |     Gain |
--- Scale 100 --------------------------------------------------------- Time 0.334 ---
| Backs         |  0.008 ms |  0.007 ms |      31,362 |   8.000 KB | Pass |  68.34 % |
| ListUnsafe    |  0.009 ms |  0.008 ms |      35,487 |   8.000 KB | Pass |  63.45 % |
| HasSet        |  0.012 ms |  0.011 ms |      46,840 |   8.000 KB | Pass |  50.03 % |
| ArrayUnsafe   |  0.013 ms |  0.011 ms |      49,388 |   8.000 KB | Pass |  47.75 % |
| HashSetUnsafe |  0.018 ms |  0.013 ms |      66,866 |  16.000 KB | Pass |  26.62 % |
| Todd          |  0.024 ms |  0.019 ms |      90,763 |  16.000 KB | Base |   0.00 % |
--- Scale 1,000 ------------------------------------------------------- Time 0.377 ---
| Backs         |  0.070 ms |  0.060 ms |     249,374 |  28.977 KB | Pass |  57.56 % |
| ListUnsafe    |  0.078 ms |  0.067 ms |     277,080 |  28.977 KB | Pass |  52.67 % |
| HasSet        |  0.093 ms |  0.083 ms |     329,686 |  28.977 KB | Pass |  43.61 % |
| ArrayUnsafe   |  0.096 ms |  0.082 ms |     340,154 |  36.977 KB | Pass |  41.72 % |
| HashSetUnsafe |  0.103 ms |  0.085 ms |     367,681 |  55.797 KB | Pass |  37.07 % |
| Todd          |  0.164 ms |  0.151 ms |     578,933 | 112.664 KB | Base |   0.00 % |
--- Scale 10,000 ------------------------------------------------------ Time 0.965 ---
| ListUnsafe    |  0.706 ms |  0.611 ms |   2,467,327 | 258.516 KB | Pass |  48.60 % |
| Backs         |  0.758 ms |  0.654 ms |   2,656,610 | 180.297 KB | Pass |  44.81 % |
| ArrayUnsafe   |  0.783 ms |  0.696 ms |   2,739,156 | 276.281 KB | Pass |  43.02 % |
| HasSet        |  0.859 ms |  0.752 ms |   2,999,230 | 198.063 KB | Pass |  37.47 % |
| HashSetUnsafe |  0.864 ms |  0.783 ms |   3,029,086 | 332.273 KB | Pass |  37.07 % |
| Todd          |  1.373 ms |  1.251 ms |   4,795,929 | 604.742 KB | Base |   0.00 % |
--- Scale 100,000 ----------------------------------------------------- Time 5.535 ---
| ListUnsafe    |  5.624 ms |  4.874 ms |  19,658,154 |   2.926 MB | Pass |  40.36 % |
| HasSet        |  7.574 ms |  6.548 ms |  26,446,193 |   2.820 MB | Pass |  19.68 % |
| Backs         |  7.585 ms |  5.634 ms |  26,303,794 |   2.009 MB | Pass |  19.57 % |
| ArrayUnsafe   |  8.287 ms |  6.219 ms |  28,923,797 |   3.583 MB | Pass |  12.12 % |
| Todd          |  9.430 ms |  7.326 ms |  32,880,985 |   2.144 MB | Base |   0.00 % |
| HashSetUnsafe |  9.601 ms |  7.859 ms |  32,845,228 |   5.197 MB | Pass |  -1.81 % |
--- Scale 1,000,000 -------------------------------------------------- Time 47.652 ---
| ListUnsafe    | 57.751 ms | 44.734 ms | 201,477,028 |  29.309 MB | Pass |  22.14 % |
| Backs         | 65.567 ms | 49.023 ms | 228,772,283 |  21.526 MB | Pass |  11.61 % |
| HasSet        | 73.163 ms | 56.799 ms | 254,703,994 |  25.904 MB | Pass |   1.36 % |
| Todd          | 74.175 ms | 53.739 ms | 258,760,390 |   9.144 MB | Base |   0.00 % |
| ArrayUnsafe   | 86.530 ms | 67.803 ms | 300,374,535 |  13.755 MB | Pass | -16.66 % |
| HashSetUnsafe | 97.140 ms | 77.844 ms | 337,639,426 |  39.527 MB | Pass | -30.96 % |
--------------------------------------------------------------------------------------

结果随机列表,在需要的地方使用不同的结果

--- Random Set 1 ---------------------------------------------------------------------
| Value         |    Average |   Fastest |      Cycles |    Garbage | Test |    Gain |
--- Scale 100 --------------------------------------------------------- Time 0.272 ---
| Backs         |   0.007 ms |  0.006 ms |      28,449 |   8.000 KB | Pass | 72.96 % |
| HasSet        |   0.010 ms |  0.009 ms |      38,222 |   8.000 KB | Pass | 62.05 % |
| HashSetUnsafe |   0.014 ms |  0.010 ms |      51,816 |  16.000 KB | Pass | 47.52 % |
| ListUnsafe    |   0.017 ms |  0.014 ms |      64,333 |  16.000 KB | Pass | 33.84 % |
| ArrayUnsafe   |   0.020 ms |  0.015 ms |      72,468 |  16.000 KB | Pass | 24.70 % |
| Todd          |   0.026 ms |  0.021 ms |      95,500 |  24.000 KB | Base |  0.00 % |
--- Scale 1,000 ------------------------------------------------------- Time 0.361 ---
| Backs         |   0.061 ms |  0.053 ms |     219,141 |  28.977 KB | Pass | 70.46 % |
| HasSet        |   0.092 ms |  0.080 ms |     325,353 |  28.977 KB | Pass | 55.78 % |
| HashSetUnsafe |   0.093 ms |  0.079 ms |     331,390 |  55.797 KB | Pass | 55.03 % |
| ListUnsafe    |   0.122 ms |  0.101 ms |     432,029 |  73.016 KB | Pass | 41.19 % |
| ArrayUnsafe   |   0.133 ms |  0.113 ms |     469,560 |  73.016 KB | Pass | 35.88 % |
| Todd          |   0.208 ms |  0.173 ms |     730,661 | 148.703 KB | Base |  0.00 % |
--- Scale 10,000 ------------------------------------------------------ Time 0.870 ---
| Backs         |   0.620 ms |  0.579 ms |   2,174,415 | 180.188 KB | Pass | 55.31 % |
| HasSet        |   0.696 ms |  0.635 ms |   2,440,300 | 198.063 KB | Pass | 49.87 % |
| HashSetUnsafe |   0.731 ms |  0.679 ms |   2,563,125 | 332.164 KB | Pass | 47.32 % |
| ListUnsafe    |   0.804 ms |  0.761 ms |   2,818,293 | 400.492 KB | Pass | 42.11 % |
| ArrayUnsafe   |   0.810 ms |  0.751 ms |   2,838,680 | 400.492 KB | Pass | 41.68 % |
| Todd          |   1.388 ms |  1.271 ms |   4,863,651 | 736.953 KB | Base |  0.00 % |
--- Scale 100,000 ----------------------------------------------------- Time 6.616 ---
| Backs         |   5.604 ms |  4.710 ms |  19,600,934 |   2.009 MB | Pass | 62.92 % |
| HasSet        |   6.607 ms |  5.847 ms |  23,093,963 |   2.820 MB | Pass | 56.29 % |
| HashSetUnsafe |   8.565 ms |  7.465 ms |  29,239,067 |   5.197 MB | Pass | 43.34 % |
| ListUnsafe    |  11.447 ms |  9.543 ms |  39,452,865 |   5.101 MB | Pass | 24.28 % |
| ArrayUnsafe   |  11.517 ms |  9.841 ms |  39,731,502 |   5.483 MB | Pass | 23.81 % |
| Todd          |  15.116 ms | 11.369 ms |  51,963,309 |   3.427 MB | Base |  0.00 % |
--- Scale 1,000,000 -------------------------------------------------- Time 55.310 ---
| Backs         |  53.766 ms | 44.321 ms | 187,905,335 |  21.526 MB | Pass | 51.32 % |
| HasSet        |  60.759 ms | 50.742 ms | 212,409,649 |  25.904 MB | Pass | 44.99 % |
| HashSetUnsafe |  79.248 ms | 67.130 ms | 275,455,545 |  39.527 MB | Pass | 28.25 % |
| ListUnsafe    | 106.527 ms | 90.159 ms | 370,838,650 |  39.153 MB | Pass |  3.55 % |
| Todd          | 110.444 ms | 93.225 ms | 384,636,081 |  22.676 MB | Base |  0.00 % |
| ArrayUnsafe   | 114.548 ms | 98.033 ms | 398,219,513 |  38.974 MB | Pass | -3.72 % |
--------------------------------------------------------------------------------------

数据

private Tuple<List<int>, List<int>> GenerateData(int scale)
{
   return new Tuple<List<int>, List<int>>(
      Enumerable.Range(0, scale)
                .Select(x => x)
                .ToList(),
      Enumerable.Range(0, scale)
                .Select(x => Rand.Next(10000))
                .ToList());
}

代码

public class Backs : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
   protected override List<int> InternalRun()
   {
      var hashSet = new HashSet<int>(Input.Item1); 
      hashSet.ExceptWith(Input.Item2); 
      return hashSet.ToList(); 
   }
}

public class HasSet : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{

   protected override List<int> InternalRun()
   {
      var hashSet = new HashSet<int>(Input.Item2); 

      return Input.Item1.Where(y => !hashSet.Contains(y)).ToList(); 
   }
}

public class Todd : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
   protected override List<int> InternalRun()
   {
      var referenceHashSet = Input.Item2.Distinct()                 
                                      .ToDictionary(x => x, x => x);

      return Input.Item1.Where(y => !referenceHashSet.TryGetValue(y, out _)).ToList();
   }
}

public unsafe class HashSetUnsafe : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
   protected override List<int> InternalRun()
   {
      var reference = new HashSet<int>(Input.Item2);
      var result = new HashSet<int>();
      fixed (int* pAry = Input.Item1.ToArray())
      {
         var len = pAry+Input.Item1.Count;
         for (var p = pAry; p < len; p++)
         {
            if(!reference.Contains(*p))
               result.Add(*p);
         }
      }
      return result.ToList(); 
   }
}
public unsafe class ListUnsafe : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
   protected override List<int> InternalRun()
   {
      var reference = new HashSet<int>(Input.Item2);
      var result = new List<int>(Input.Item2.Count);

      fixed (int* pAry = Input.Item1.ToArray())
      {
         var len = pAry+Input.Item1.Count;
         for (var p = pAry; p < len; p++)
         {
            if(!reference.Contains(*p))
               result.Add(*p);
         }
      }
      return result.ToList(); 
   }
}

public unsafe class ArrayUnsafe : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
   protected override List<int> InternalRun()
   {
      var reference = new HashSet<int>(Input.Item2);
      var result = new int[Input.Item1.Count];

      fixed (int* pAry = Input.Item1.ToArray(), pRes = result)
      {
         var j = 0;
         var len = pAry+Input.Item1.Count;
         for (var p = pAry; p < len; p++)
         {
            if(!reference.Contains(*p))
               *(pRes+j++) = *p;
         }
         return result.Take(j).ToList(); 
      }

   }
}

摘要

如果您有一个独特的列表以更好地解决某些问题,那么在这里真的就不足为奇了,如果不是,最简单的哈希集版本不是最好的

答案 2 :(得分:0)

单循环双索引

如@PepitoSh在问题注释中所建议:

  

我认为HashSet是针对非常具体的问题的非常通用的解决方案   问题。如果您的列表是有序的,则并行扫描它们并进行比较   当前项目是最快的

这与具有两个嵌套循环非常不同。相反,只有一个通用循环,并且索引会根据相对值的差异并行递增。区别基本上是任何常规比较函数的输出:{negative,0,positive}

static IEnumerable<int> deduplicationFunction4(List<int> Set, List<int> Reference)
{
    int i1 = 0;
    int i2 = 0;
    int thisValue = Set[i1];
    int thisReference = Reference[i2];
    while (true)
    {
        var difference = thisReference - thisValue;

        if (difference < 0)
        {
            i2++; //Compare side is too low, there might be an equal value to be found
            if (i2 == Reference.Count)
                break;
            thisReference = Reference[i2];
            continue;
        }

        if (difference > 0) //Duplicate
            yield return thisValue;

        GoFurther:
        i1++;
        if (i1 == Set.Count)
            break;
        if (Set[i1] == thisValue) //Eliminates duplicates
            goto GoFurther; //I rarely use goto statements, but this is a good situation

        thisValue = Set[i1];
    }
}

如果列表尚未排序,如何调用此函数:

Set.Sort();
Reference.Sort();
Reference.Add(Set[Set.Count - 1] + 1); //Ensure the last set item is non-duplicate for an In-built stop clause. This is easy for int list items, just + 1 on the last item.

return deduplicationFunction4(Set, Reference);

这使我在基准测试中表现最佳。在某些情况下,也可以使用不安全的代码尝试这种方法,以提高速度。在已经对数据进行排序的情况下,这是迄今为止最好的。也可以选择一种更快的排序算法,但不能选择此问题的主题。

注意:此方法会重复数据删除。

在最终确定文本搜索结果之前,我实际上已经编码了这样的单循环模式,除了我有N个数组来检查“紧密度”。因此,我有一个索引数组-array[index[i]]。因此,我确定具有控制索引增量的单个循环不是一个新概念,但这绝对是一个很好的解决方案。

答案 3 :(得分:-5)

哈希集和位置

您必须使用HashSet(或Dictionary)来提高速度:

//Returns an IEnumerable from which more can be chained or simply terminated with ToList by the caller
IEnumerable<int> deduplicationFunction(List<int> Set, List<int> Reference)
{
    //Create a hashset first, which is much more efficient for searching
    var ReferenceHashSet = Reference
                        .Distinct() //Inserting duplicate keys in a dictionary will cause an exception
                        .ToDictionary(x => x, x => x); //If there was a ToHashSet function, that would be nicer

    int throwAway;
        return Set.Where(y => ReferenceHashSet.TryGetValue(y, out throwAway));
}

那是lambda表达式版本。它使用Dictionary(字典),该字典可根据需要提供用于更改值的适应性。可以使用文字for循环,也许可以获得更多的增量性能改进,但是相对于具有两个嵌套的循环,这已经是一个了不起的改进。

在学习其他答案的同时学习一些知识,这是一种更快的实现方式:

static IEnumerable<int> deduplicationFunction(List<int> Set, List<int> Reference)
{
    //Create a hashset first, which is much more efficient for searching
    var ReferenceHashSet = new HashSet<int>(Reference);
    return Set.Where(y => ReferenceHashSet.Contains(y) == false).Distinct();
}

重要的是,这种方法(虽然比@backs答案慢一点点)仍然足够通用,可以用于数据库实体,并且其他类型也可以轻松地用于重复检查字段。

下面是一个示例,该示例说明如何轻松调整代码以与Person类型的数据库实体列表一起使用。

static IEnumerable<Person> deduplicatePeople(List<Person> Set, List<Person> Reference)
{
    //Create a hashset first, which is much more efficient for searching
    var ReferenceHashSet = new HashSet<int>(Reference.Select(p => p.ID));
    return Set.Where(y => ReferenceHashSet.Contains(y.ID) == false)
            .GroupBy(p => p.ID).Select(p => p.First()); //The groupby and select should accomplish DistinctBy(..p.ID)
}