性能

Question

当我合并2个哈希集时，HashSet.Union与HashSet.Unionwith之间有什么区别。

我想这样结合：

HashSet<EngineType> enginesSupportAll = _filePolicyEvaluation.EnginesSupportAll;
        enginesSupportAll = enginesSupportAll != null ? new HashSet<EngineType>(engines.Union(enginesSupportAll)) : enginesSupportAll;

此示例的最佳方法是什么？为什么？

Answer 1

嗯，它不是HashSet.Union而是Enumerable.Union，所以你使用的LINQ扩展方法适用于任何类型的IEnumerable<>，而HashSet.UnionWith是一个修改当前实例的真实HashSet方法。

Union会返回IEnumerable<TSource>
UnionWith为void，它会修改当前的HashSet实例
也许UnionWith效率稍高，因为它可以优化

如果您不想在方法中支持任何类型的序列，那么HashSet已修复并且您可以修改它，请使用它，否则使用LINQ扩展。如果您仅为此目的创建HashSet实例，那么它并不重要，我希望LINQ更灵活，并且能够链接我的查询。

Answer 2

鉴于HashSet<T> A 和HashSet<T> B ，有四种方法 A ∪ B < / EM>：

new HashSet<t>(A.Union(B))
_{（见HashSet<T&>(IEnumerable<T>)和
Enumerable.Union<T>(IEnumerable<T>, IEnumerable<T>)）}

A.UnionWith(B)

HashSet<T> C = new HashSet<T>(A); C.UnionWith(B);

new HashSet<t>(A.Concat(B))
_{（请参阅Enumerable.Concat<T>(IEnumerable<T>, IEnumerable<T>)）}

每个都有其优点和缺点：

1和4是导致HashSet的表达式，而2和3是语句或语句块表达式1和4可以在2和3以外的地方使用。例如，在linq查询语法表达式中使用2或3是很麻烦的：
由于from x in setofSetsA as IEnumerable<HashSet<T>> from y in setOfSetsB as IEnumerable<HashSet<T>> select x.UnionWith(y)返回void。
，UnionWith将无效
1,3和4保留 A 和 B ，并返回一个新组，而2修改 A 。
在某些情况下，修改其中一个原始集很糟糕，并且有些情况下至少可以修改一个原始集而不会产生负面后果。

计算成本：

A.UnionWith(B)
（≈O（（log（|A∪B|） - log（| A |））* |A∪B|）+ O（| B |））

≤

HashSet<T> C = new HashSet<T>(A); C.UnionWith(B);
（≈O（（log（|A∪B|） - log（| A |））* |A∪B|）+ O（| A | + | B |））

≤

HashSet<T>(A.Concat(B))
（≈O（log（|A∪B|）* |A∪B|）+ O（| A | + | B |））

≤

HashSet<T>(A.Union(B))
（≈2* O（log（|A∪B|）* |A∪B|）+ O（| A | + | B | + |A∪B|））

下一部分深入研究reference source，了解这些效果估算的基础。

性能

HashSet<T>

在联合选项1,3和4中，构造函数HashSet<T>(IEnumerable<T>, IEqualityComparer<T>)用于从HashSet<T>创建IEnumerable<T>。如果传递的IEnumerable<T>有Count property -i.e.如果是ICollection<T> - ，则此属性用于设置新HashSet的大小：

int suggestedCapacity = 0; ICollection<T> coll = collection as ICollection<T>; if (coll != null) { suggestedCapacity = coll.Count; } Initialize(suggestedCapacity);

- HashSet.cs line 136–141

永远不会调用[Count()][10]方法。因此，如果可以毫不费力地检索IEnumerable的计数，则可以使用它来预留容量;否则HashSet会增加并在添加新元素时重新分配在选项1 A.Union(B)和选项4 A.Concat(B)不是ICollection<T>因此，创建的HashSet将增长并重新分配一些（≈log（|A∪B|））次。选项3可以使用 A 的Count。

构造函数调用UnionWith来填充新的空HashSet：

this.UnionWith(collection);

- HashSet.cs line 143

UnionWith(IEnumerable<T>)遍历IEnumerable<T>作为参数传递的元素，并为每个元素调用AddIfNotPresent(T)。

AddIfNotPresent(T)插入元素并确保重复项永远不会插入到集合中 HashSet<T>实现为一个插槽阵列m_slots和一系列存储区m_buckets。存储桶仅包含int数组的m_slots索引。每个存储区m_slots中的Slot形成一个链接列表，其中包含Slot中下一个m_slots的索引。

AddIfNotPresent(T)跳转到正确的存储桶，然后遍历其链接列表以检查该元素是否已存在：

for (int i = m_buckets[hashCode % m_buckets.Length] - 1; i >= 0; i = m_slots[i].next) { if (m_slots[i].hashCode == hashCode && m_comparer.Equals(m_slots[i].value, value)) { return false; } }

- HashSet.cs line 968–975

接下来找到一个免费索引并保留一个插槽。首先，检查空闲插槽列表m_freelist。当空闲列表中没有插槽时，将使用m_slots数组中的下一个空插槽。如果空闲列表中没有插槽且没有空插槽，则保留更多容量（通过IncreaseCapacity()）：

int index; if (m_freeList >= 0) { index = m_freeList; m_freeList = m_slots[index].next; } else { if (m_lastIndex == m_slots.Length) { IncreaseCapacity(); // this will change during resize bucket = hashCode % m_buckets.Length; } index = m_lastIndex; m_lastIndex++; }

- HashSet.cs line 977–990

AddIfNotPresent(T)有三个需要进行一些计算的操作：调用object.GetHashCode()，在发生冲突时调用object.Equals(object)，以及IncreaseCapacity()。实际添加元素只会产生设置一些指针和一些整数的成本。

当HashSet<T>需要IncreaseCapacity()时，容量至少增加一倍。因此，我们可以得出结论，平均HashSet<T>填充75％。如果哈希值均匀分布，则哈希冲突的期望值也为75％。
由IncreaseCapacity()调用的
SetCapacity(int, bool)是最昂贵的：它分配新数组，它将旧的插槽数组复制到新数组，并重新计算存储桶列表：

Slot[] newSlots = new Slot[newSize]; if (m_slots != null) { Array.Copy(m_slots, 0, newSlots, 0, m_lastIndex); } ... int[] newBuckets = new int[newSize]; for (int i = 0; i < m_lastIndex; i++) { int bucket = newSlots[i].hashCode % newSize; newSlots[i].next = newBuckets[bucket] - 1; newBuckets[bucket] = i + 1; } m_slots = newSlots; m_buckets = newBuckets;

- HashSet.cs line 929–949

选项1和4（new HashSet<T>(A.Union(B))）会导致稍微调用IncreaseCapacity()。费用（不包括A.Union(B)或A.Concat(B) - 大约是 O（日志（|A∪B|）* |A∪B|）。
然而，当使用选项2（A.UnionWith(B)）或选项3（HashSet<T> C = new HashSet<T>(A); C.UnionWith(B)）时，我们会获得＆＃39;折扣＆＃39;对成本的日志（| A |）： O（（log（|A∪B|） - log（| A |））* |A∪B|）。它（稍微）支付使用最大的集合作为目标，而另一个被合并。

Enumerable<T>.Union(IEnumerable<T>)

Enumerable<T>.Union(IEnumerable<T>)是通过UnionIterator<T>(IEnumerable<T>, IEnumerable<T>, IEqualityComparer<T>)实施的 UnionIterator使用Enumerable.cs - HashSet<T>中的内部类 - 与UnionIterator非常相似。 {em} 和 B Set<T> lazily Set<T>项目，如果可以添加yields和HashSet<T>.AddIfNotPresent(T)元素。该工作在Add(T)s完成，与int hashCode = InternalGetHashCode(value); for (int i = buckets[hashCode % buckets.Length] - 1; i >= 0; i = slots[i].next) { if (slots[i].hashCode == hashCode && comparer.Equals(slots[i].value, value)) return true; }类似。检查元素是否已存在：

int index;
if (freeList >= 0) {
    index = freeList;
    freeList = slots[index].next;
}
else {
    if (count == slots.Length) Resize();
    index = count;
    count++;
}
int bucket = hashCode % buckets.Length;
slots[index].hashCode = hashCode;
slots[index].value = value;
slots[index].next = buckets[bucket] - 1;
buckets[bucket] = index + 1;

- Find(T, bool)

找一个免费索引并保留一个插槽：

IncreaseCapacity()
- Enumerable.cs line 2423–2426

Enumerable.cs line 2428–2442与Resize()类似。两者之间最大的区别是GetHashCode()不会使用Resize()来表示存储桶的数量，因此如果错误Resize()，则发生碰撞的可能性略高。代码int newSize = checked(count * 2 + 1); int[] newBuckets = new int[newSize]; Slot[] newSlots = new Slot[newSize]; Array.Copy(slots, 0, newSlots, 0, count); for (int i = 0; i < count; i++) { int bucket = newSlots[i].hashCode % newSize; newSlots[i].next = newBuckets[bucket] - 1; newBuckets[bucket] = i + 1; } buckets = newBuckets; slots = newSlots;：

A.Union(B)
- prime number

HashSet<T> C = new HashSet<T>(); C.UnionWith(A); C.UnionWith(B);的效果成本与new HashSet<T>(A.Union(B))的性能成本没有显着差异。在选项1（HashSet）中，相同的HashSet<T>(IEnumerable<T>)被创建两次，导致非常昂贵的2 * O（log（|A∪B|）*（|A∪B|））。选项4源于了解Enumerable.Union(IEnumerable<T>, IEnumerable<T>)和A.Union(B)的实施方式。它避免了多余的{{1}}导致O（log（|A∪B|）* |A∪B|）的成本。

HashSet中的Union vs Unionwith

2 个答案:

性能

`HashSet<T>`

`Enumerable<T>.Union(IEnumerable<T>)`