为什么
在两个列表之间进行比较和重复数据删除时,编码人员在时间紧迫的情况下往往找不到最有效的运行时实现。对于许多编码器,两个嵌套的for循环是常见的goto解决方案。可以尝试使用LINQ进行CROSS JOIN,但这显然效率不高。为此,编码人员需要一种令人难忘且代码高效的方法,同时也要相对节省运行时间。
这个问题是在看到一个更具体的问题之后创建的:Delete duplicates in a single dataset relative to another one in C#-它在使用数据集方面更加专业。术语“数据集”将来不会对人们有所帮助。找不到其他一般性问题。
什么
我使用术语“列表/集合”来解决这个更通用的编码问题。
var setToDeduplicate = new List<int>() { 1,2,3,4,5,6,7,8,9,10,11,.....}; //All integer values 1-1M
var referenceSet = new List<int>() { 1,3,5,7,9,....}; //All odd integer values 1-1M
var deduplicatedSet = deduplicationFunction(setToDeduplicate, referenceSet);
通过实施deduplicationFunction函数,输入数据和输出应清晰可见。输出可以是IEnumerable。此输入示例中的预期输出将是1-1M {2,4,6,8,...}
中的偶数注意:referenceSet中可能有重复项。两组中的值仅是指示性的,因此我不希望找到数学解决方案-这也适用于两组中的随机数输入。
如果使用简单的LINQ函数进行处理,它将太慢O(1M * 0.5M)。对于如此大的集合,需要一种更快的方法。
速度很重要,但是随着大量代码的增加而进行的改进的价值将降低。同样,理想情况下,它也适用于其他数据类型,包括数据模型对象,但回答此特定问题应该就足够了。其他数据类型将只涉及更多的预处理或对答案的轻微更改。
解决方案摘要
这是测试代码,其结果如下:
using System;
using System.Collections.Generic;
using System.Data;
using System.Diagnostics;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Test
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Preparing...");
List<int> set1 = new List<int>();
List<int> set2 = new List<int>();
Random r = new Random();
var max = 10000;
for (int i = 0; i < max; i++)
{
set1.Add(r.Next(0, max));
set2.Add(r.Next(0, max/2) * 2);
}
Console.WriteLine("First run...");
Stopwatch sw = new Stopwatch();
IEnumerable<int> result;
int count;
while (true)
{
sw.Start();
result = deduplicationFunction(set1, set2);
var results1 = result.ToList();
count = results1.Count;
sw.Stop();
Console.WriteLine("Dictionary and Where - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
sw.Reset();
sw.Start();
result = deduplicationFunction2(set1, set2);
var results2 = result.ToList();
count = results2.Count;
sw.Stop();
Console.WriteLine(" HashSet ExceptWith - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
sw.Reset();
sw.Start();
result = deduplicationFunction3(set1, set2);
var results3 = result.ToList();
count = results3.Count;
sw.Stop();
Console.WriteLine(" Sort Dual Index - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
sw.Reset();
sw.Start();
result = deduplicationFunction4(set1, set2);
var results4 = result.ToList();
count = results3.Count;
sw.Stop();
Console.WriteLine("Presorted Dual Index - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
sw.Reset();
set2.RemoveAt(set2.Count - 1); //Remove the last item, because it was added in the 3rd test
sw.Start();
result = deduplicationFunction5(set1, set2);
var results5 = result.ToList();
count = results5.Count;
sw.Stop();
Console.WriteLine(" Nested Index - Count: {0}, Milliseconds: {1:0.00}.", count, sw.ElapsedTicks / (decimal)10000);
sw.Reset();
Console.ReadLine();
Console.WriteLine("");
Console.WriteLine("Next Run");
Console.WriteLine("");
}
}
//Returns an IEnumerable from which more can be chained or simply terminated with ToList by the caller
static IEnumerable<int> deduplicationFunction(List<int> Set, List<int> Reference)
{
//Create a hashset first, which is much more efficient for searching
var ReferenceHashSet = Reference
.Distinct() //Inserting duplicate keys in a dictionary will cause an exception
.ToDictionary(x => x, x => x); //If there was a ToHashSet function, that would be nicer
int throwAway;
return Set.Distinct().Where(y => ReferenceHashSet.TryGetValue(y, out throwAway) == false);
}
//Returns an IEnumerable from which more can be chained or simply terminated with ToList by the caller
static IEnumerable<int> deduplicationFunction2(List<int> Set, List<int> Reference)
{
//Create a hashset first, which is much more efficient for searching
var SetAsHash = new HashSet<int>();
Set.ForEach(x =>
{
if (SetAsHash.Contains(x))
return;
SetAsHash.Add(x);
}); // .Net 4.7.2 - ToHashSet will reduce this code to a single line.
SetAsHash.ExceptWith(Reference); // This is ultimately what we're testing
return SetAsHash.AsEnumerable();
}
static IEnumerable<int> deduplicationFunction3(List<int> Set, List<int> Reference)
{
Set.Sort();
Reference.Sort();
Reference.Add(Set[Set.Count - 1] + 1); //Ensure the last set item is non-duplicate for an In-built stop clause. This is easy for int list items, just + 1 on the last item.
return deduplicationFunction4(Set, Reference);
}
static IEnumerable<int> deduplicationFunction4(List<int> Set, List<int> Reference)
{
int i1 = 0;
int i2 = 0;
int thisValue = Set[i1];
int thisReference = Reference[i2];
while (true)
{
var difference = thisReference - thisValue;
if (difference < 0)
{
i2++; //Compare side is too low, there might be an equal value to be found
if (i2 == Reference.Count)
break;
thisReference = Reference[i2];
continue;
}
if (difference > 0) //Duplicate
yield return thisValue;
GoFurther:
i1++;
if (i1 == Set.Count)
break;
if (Set[i1] == thisValue) //Eliminates duplicates
goto GoFurther; //I rarely use goto statements, but this is a good situation
thisValue = Set[i1];
}
}
static IEnumerable<int> deduplicationFunction5(List<int> Set, List<int> Reference)
{
var found = false;
var lastValue = 0;
var thisValue = 0;
for (int i = 0; i < Set.Count; i++)
{
thisValue = Set[i];
if (thisValue == lastValue)
continue;
lastValue = thisValue;
found = false;
for (int x = 0; x < Reference.Count; x++)
{
if (thisValue != Reference[x])
continue;
found = true;
break;
}
if (found)
continue;
yield return thisValue;
}
}
}
}
我将使用它来比较多种方法的性能。 (尽管ExceptWith
启用了简洁的解决方案,但我在这个阶段对散列方法与分类索引双重索引特别感兴趣)
到目前为止,已完成1万个项目的结果(良好运行):
首次运行
好运
选择答案:
ExceptWith
,但是由于缺乏通用性而被削弱了,并且有趣的函数鲜为人知。 li>
unsafe
版本进行了基准测试,并帮助@Backs回答了更有效的后续测试。答案 0 :(得分:1)
使用HashSet作为初始列表,并使用HTMLTableElement
方法获取结果设置:
var rows = (document.getElementById('associatedEmailsTable') as HTMLTableElement).rows;
答案 1 :(得分:1)
还有更多,基本上我想针对各种解决方案测试不同的输入和不同的输入。在非区别版本中,我必须在最终输出中需要的地方调用区别。
Mode : Release (64Bit)
Test Framework : .NET Framework 4.7.1
Operating System : Microsoft Windows 10 Pro
Version : 10.0.17134
CPU Name : Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
Description : Intel64 Family 6 Model 58 Stepping 9
Cores (Threads) : 4 (8) : Architecture : x64
Clock Speed : 3901 MHz : Bus Speed : 100 MHz
L2Cache : 1 MB : L3Cache : 8 MB
Benchmarks Runs : Inputs (1) * Scales (5) * Benchmarks (6) * Runs (100) = 3,000
结果不同的输入
--- Random Set 1 ---------------------------------------------------------------------
| Value | Average | Fastest | Cycles | Garbage | Test | Gain |
--- Scale 100 --------------------------------------------------------- Time 0.334 ---
| Backs | 0.008 ms | 0.007 ms | 31,362 | 8.000 KB | Pass | 68.34 % |
| ListUnsafe | 0.009 ms | 0.008 ms | 35,487 | 8.000 KB | Pass | 63.45 % |
| HasSet | 0.012 ms | 0.011 ms | 46,840 | 8.000 KB | Pass | 50.03 % |
| ArrayUnsafe | 0.013 ms | 0.011 ms | 49,388 | 8.000 KB | Pass | 47.75 % |
| HashSetUnsafe | 0.018 ms | 0.013 ms | 66,866 | 16.000 KB | Pass | 26.62 % |
| Todd | 0.024 ms | 0.019 ms | 90,763 | 16.000 KB | Base | 0.00 % |
--- Scale 1,000 ------------------------------------------------------- Time 0.377 ---
| Backs | 0.070 ms | 0.060 ms | 249,374 | 28.977 KB | Pass | 57.56 % |
| ListUnsafe | 0.078 ms | 0.067 ms | 277,080 | 28.977 KB | Pass | 52.67 % |
| HasSet | 0.093 ms | 0.083 ms | 329,686 | 28.977 KB | Pass | 43.61 % |
| ArrayUnsafe | 0.096 ms | 0.082 ms | 340,154 | 36.977 KB | Pass | 41.72 % |
| HashSetUnsafe | 0.103 ms | 0.085 ms | 367,681 | 55.797 KB | Pass | 37.07 % |
| Todd | 0.164 ms | 0.151 ms | 578,933 | 112.664 KB | Base | 0.00 % |
--- Scale 10,000 ------------------------------------------------------ Time 0.965 ---
| ListUnsafe | 0.706 ms | 0.611 ms | 2,467,327 | 258.516 KB | Pass | 48.60 % |
| Backs | 0.758 ms | 0.654 ms | 2,656,610 | 180.297 KB | Pass | 44.81 % |
| ArrayUnsafe | 0.783 ms | 0.696 ms | 2,739,156 | 276.281 KB | Pass | 43.02 % |
| HasSet | 0.859 ms | 0.752 ms | 2,999,230 | 198.063 KB | Pass | 37.47 % |
| HashSetUnsafe | 0.864 ms | 0.783 ms | 3,029,086 | 332.273 KB | Pass | 37.07 % |
| Todd | 1.373 ms | 1.251 ms | 4,795,929 | 604.742 KB | Base | 0.00 % |
--- Scale 100,000 ----------------------------------------------------- Time 5.535 ---
| ListUnsafe | 5.624 ms | 4.874 ms | 19,658,154 | 2.926 MB | Pass | 40.36 % |
| HasSet | 7.574 ms | 6.548 ms | 26,446,193 | 2.820 MB | Pass | 19.68 % |
| Backs | 7.585 ms | 5.634 ms | 26,303,794 | 2.009 MB | Pass | 19.57 % |
| ArrayUnsafe | 8.287 ms | 6.219 ms | 28,923,797 | 3.583 MB | Pass | 12.12 % |
| Todd | 9.430 ms | 7.326 ms | 32,880,985 | 2.144 MB | Base | 0.00 % |
| HashSetUnsafe | 9.601 ms | 7.859 ms | 32,845,228 | 5.197 MB | Pass | -1.81 % |
--- Scale 1,000,000 -------------------------------------------------- Time 47.652 ---
| ListUnsafe | 57.751 ms | 44.734 ms | 201,477,028 | 29.309 MB | Pass | 22.14 % |
| Backs | 65.567 ms | 49.023 ms | 228,772,283 | 21.526 MB | Pass | 11.61 % |
| HasSet | 73.163 ms | 56.799 ms | 254,703,994 | 25.904 MB | Pass | 1.36 % |
| Todd | 74.175 ms | 53.739 ms | 258,760,390 | 9.144 MB | Base | 0.00 % |
| ArrayUnsafe | 86.530 ms | 67.803 ms | 300,374,535 | 13.755 MB | Pass | -16.66 % |
| HashSetUnsafe | 97.140 ms | 77.844 ms | 337,639,426 | 39.527 MB | Pass | -30.96 % |
--------------------------------------------------------------------------------------
结果随机列表,在需要的地方使用不同的结果
--- Random Set 1 ---------------------------------------------------------------------
| Value | Average | Fastest | Cycles | Garbage | Test | Gain |
--- Scale 100 --------------------------------------------------------- Time 0.272 ---
| Backs | 0.007 ms | 0.006 ms | 28,449 | 8.000 KB | Pass | 72.96 % |
| HasSet | 0.010 ms | 0.009 ms | 38,222 | 8.000 KB | Pass | 62.05 % |
| HashSetUnsafe | 0.014 ms | 0.010 ms | 51,816 | 16.000 KB | Pass | 47.52 % |
| ListUnsafe | 0.017 ms | 0.014 ms | 64,333 | 16.000 KB | Pass | 33.84 % |
| ArrayUnsafe | 0.020 ms | 0.015 ms | 72,468 | 16.000 KB | Pass | 24.70 % |
| Todd | 0.026 ms | 0.021 ms | 95,500 | 24.000 KB | Base | 0.00 % |
--- Scale 1,000 ------------------------------------------------------- Time 0.361 ---
| Backs | 0.061 ms | 0.053 ms | 219,141 | 28.977 KB | Pass | 70.46 % |
| HasSet | 0.092 ms | 0.080 ms | 325,353 | 28.977 KB | Pass | 55.78 % |
| HashSetUnsafe | 0.093 ms | 0.079 ms | 331,390 | 55.797 KB | Pass | 55.03 % |
| ListUnsafe | 0.122 ms | 0.101 ms | 432,029 | 73.016 KB | Pass | 41.19 % |
| ArrayUnsafe | 0.133 ms | 0.113 ms | 469,560 | 73.016 KB | Pass | 35.88 % |
| Todd | 0.208 ms | 0.173 ms | 730,661 | 148.703 KB | Base | 0.00 % |
--- Scale 10,000 ------------------------------------------------------ Time 0.870 ---
| Backs | 0.620 ms | 0.579 ms | 2,174,415 | 180.188 KB | Pass | 55.31 % |
| HasSet | 0.696 ms | 0.635 ms | 2,440,300 | 198.063 KB | Pass | 49.87 % |
| HashSetUnsafe | 0.731 ms | 0.679 ms | 2,563,125 | 332.164 KB | Pass | 47.32 % |
| ListUnsafe | 0.804 ms | 0.761 ms | 2,818,293 | 400.492 KB | Pass | 42.11 % |
| ArrayUnsafe | 0.810 ms | 0.751 ms | 2,838,680 | 400.492 KB | Pass | 41.68 % |
| Todd | 1.388 ms | 1.271 ms | 4,863,651 | 736.953 KB | Base | 0.00 % |
--- Scale 100,000 ----------------------------------------------------- Time 6.616 ---
| Backs | 5.604 ms | 4.710 ms | 19,600,934 | 2.009 MB | Pass | 62.92 % |
| HasSet | 6.607 ms | 5.847 ms | 23,093,963 | 2.820 MB | Pass | 56.29 % |
| HashSetUnsafe | 8.565 ms | 7.465 ms | 29,239,067 | 5.197 MB | Pass | 43.34 % |
| ListUnsafe | 11.447 ms | 9.543 ms | 39,452,865 | 5.101 MB | Pass | 24.28 % |
| ArrayUnsafe | 11.517 ms | 9.841 ms | 39,731,502 | 5.483 MB | Pass | 23.81 % |
| Todd | 15.116 ms | 11.369 ms | 51,963,309 | 3.427 MB | Base | 0.00 % |
--- Scale 1,000,000 -------------------------------------------------- Time 55.310 ---
| Backs | 53.766 ms | 44.321 ms | 187,905,335 | 21.526 MB | Pass | 51.32 % |
| HasSet | 60.759 ms | 50.742 ms | 212,409,649 | 25.904 MB | Pass | 44.99 % |
| HashSetUnsafe | 79.248 ms | 67.130 ms | 275,455,545 | 39.527 MB | Pass | 28.25 % |
| ListUnsafe | 106.527 ms | 90.159 ms | 370,838,650 | 39.153 MB | Pass | 3.55 % |
| Todd | 110.444 ms | 93.225 ms | 384,636,081 | 22.676 MB | Base | 0.00 % |
| ArrayUnsafe | 114.548 ms | 98.033 ms | 398,219,513 | 38.974 MB | Pass | -3.72 % |
--------------------------------------------------------------------------------------
数据
private Tuple<List<int>, List<int>> GenerateData(int scale)
{
return new Tuple<List<int>, List<int>>(
Enumerable.Range(0, scale)
.Select(x => x)
.ToList(),
Enumerable.Range(0, scale)
.Select(x => Rand.Next(10000))
.ToList());
}
代码
public class Backs : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
protected override List<int> InternalRun()
{
var hashSet = new HashSet<int>(Input.Item1);
hashSet.ExceptWith(Input.Item2);
return hashSet.ToList();
}
}
public class HasSet : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
protected override List<int> InternalRun()
{
var hashSet = new HashSet<int>(Input.Item2);
return Input.Item1.Where(y => !hashSet.Contains(y)).ToList();
}
}
public class Todd : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
protected override List<int> InternalRun()
{
var referenceHashSet = Input.Item2.Distinct()
.ToDictionary(x => x, x => x);
return Input.Item1.Where(y => !referenceHashSet.TryGetValue(y, out _)).ToList();
}
}
public unsafe class HashSetUnsafe : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
protected override List<int> InternalRun()
{
var reference = new HashSet<int>(Input.Item2);
var result = new HashSet<int>();
fixed (int* pAry = Input.Item1.ToArray())
{
var len = pAry+Input.Item1.Count;
for (var p = pAry; p < len; p++)
{
if(!reference.Contains(*p))
result.Add(*p);
}
}
return result.ToList();
}
}
public unsafe class ListUnsafe : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
protected override List<int> InternalRun()
{
var reference = new HashSet<int>(Input.Item2);
var result = new List<int>(Input.Item2.Count);
fixed (int* pAry = Input.Item1.ToArray())
{
var len = pAry+Input.Item1.Count;
for (var p = pAry; p < len; p++)
{
if(!reference.Contains(*p))
result.Add(*p);
}
}
return result.ToList();
}
}
public unsafe class ArrayUnsafe : Benchmark<Tuple<List<int>, List<int>>, List<int>>
{
protected override List<int> InternalRun()
{
var reference = new HashSet<int>(Input.Item2);
var result = new int[Input.Item1.Count];
fixed (int* pAry = Input.Item1.ToArray(), pRes = result)
{
var j = 0;
var len = pAry+Input.Item1.Count;
for (var p = pAry; p < len; p++)
{
if(!reference.Contains(*p))
*(pRes+j++) = *p;
}
return result.Take(j).ToList();
}
}
}
如果您有一个独特的列表以更好地解决某些问题,那么在这里真的就不足为奇了,如果不是,最简单的哈希集版本不是最好的
答案 2 :(得分:0)
单循环双索引
如@PepitoSh在问题注释中所建议:
我认为HashSet是针对非常具体的问题的非常通用的解决方案 问题。如果您的列表是有序的,则并行扫描它们并进行比较 当前项目是最快的
这与具有两个嵌套循环非常不同。相反,只有一个通用循环,并且索引会根据相对值的差异并行递增。区别基本上是任何常规比较函数的输出:{negative,0,positive}
static IEnumerable<int> deduplicationFunction4(List<int> Set, List<int> Reference)
{
int i1 = 0;
int i2 = 0;
int thisValue = Set[i1];
int thisReference = Reference[i2];
while (true)
{
var difference = thisReference - thisValue;
if (difference < 0)
{
i2++; //Compare side is too low, there might be an equal value to be found
if (i2 == Reference.Count)
break;
thisReference = Reference[i2];
continue;
}
if (difference > 0) //Duplicate
yield return thisValue;
GoFurther:
i1++;
if (i1 == Set.Count)
break;
if (Set[i1] == thisValue) //Eliminates duplicates
goto GoFurther; //I rarely use goto statements, but this is a good situation
thisValue = Set[i1];
}
}
如果列表尚未排序,如何调用此函数:
Set.Sort();
Reference.Sort();
Reference.Add(Set[Set.Count - 1] + 1); //Ensure the last set item is non-duplicate for an In-built stop clause. This is easy for int list items, just + 1 on the last item.
return deduplicationFunction4(Set, Reference);
这使我在基准测试中表现最佳。在某些情况下,也可以使用不安全的代码尝试这种方法,以提高速度。在已经对数据进行排序的情况下,这是迄今为止最好的。也可以选择一种更快的排序算法,但不能选择此问题的主题。
注意:此方法会重复数据删除。
在最终确定文本搜索结果之前,我实际上已经编码了这样的单循环模式,除了我有N个数组来检查“紧密度”。因此,我有一个索引数组-array[index[i]]
。因此,我确定具有控制索引增量的单个循环不是一个新概念,但这绝对是一个很好的解决方案。
答案 3 :(得分:-5)
哈希集和位置
您必须使用HashSet(或Dictionary)来提高速度:
//Returns an IEnumerable from which more can be chained or simply terminated with ToList by the caller
IEnumerable<int> deduplicationFunction(List<int> Set, List<int> Reference)
{
//Create a hashset first, which is much more efficient for searching
var ReferenceHashSet = Reference
.Distinct() //Inserting duplicate keys in a dictionary will cause an exception
.ToDictionary(x => x, x => x); //If there was a ToHashSet function, that would be nicer
int throwAway;
return Set.Where(y => ReferenceHashSet.TryGetValue(y, out throwAway));
}
那是lambda表达式版本。它使用Dictionary(字典),该字典可根据需要提供用于更改值的适应性。可以使用文字for循环,也许可以获得更多的增量性能改进,但是相对于具有两个嵌套的循环,这已经是一个了不起的改进。
在学习其他答案的同时学习一些知识,这是一种更快的实现方式:
static IEnumerable<int> deduplicationFunction(List<int> Set, List<int> Reference)
{
//Create a hashset first, which is much more efficient for searching
var ReferenceHashSet = new HashSet<int>(Reference);
return Set.Where(y => ReferenceHashSet.Contains(y) == false).Distinct();
}
重要的是,这种方法(虽然比@backs答案慢一点点)仍然足够通用,可以用于数据库实体,并且其他类型也可以轻松地用于重复检查字段。
下面是一个示例,该示例说明如何轻松调整代码以与Person
类型的数据库实体列表一起使用。
static IEnumerable<Person> deduplicatePeople(List<Person> Set, List<Person> Reference)
{
//Create a hashset first, which is much more efficient for searching
var ReferenceHashSet = new HashSet<int>(Reference.Select(p => p.ID));
return Set.Where(y => ReferenceHashSet.Contains(y.ID) == false)
.GroupBy(p => p.ID).Select(p => p.First()); //The groupby and select should accomplish DistinctBy(..p.ID)
}