Question

我有一个字典，其中HashSet为Value。我有一个带有键的int []，我希望在HashSet中得到常用值的Count。

这是一段代码，它以非常低效的方式工作，因为它需要创建一个HashSet并在最终Count之前在内存中修改它。

        Dictionary<int, HashSet<int>> d = new Dictionary<int, HashSet<int>>();

        HashSet<int> s1 = new HashSet<int>() { 3, 4, 5, 6, 7, 8, 9 };
        HashSet<int> s2 = new HashSet<int>() { 1, 2, 3, 4, 5, 8 };
        HashSet<int> s3 = new HashSet<int>() { 1, 3, 5, 10, 15, 20 };
        HashSet<int> s4 = new HashSet<int>() { 1, 20 };

        d.Add(10, s1);
        d.Add(15, s2);
        d.Add(20, s3);
        d.Add(25, s4);

        // List of keys from which I need the intersection of the HashSet's
        int[] l = new int[3] { 10, 15, 20 };

        // Get an IEnumerator with the HashSet from the values of the selected Dictionary entries (10,15,20 selects s1, s2 and s3)
        var hashlist = d.Where(x => l.Contains(x.Key));

        // Create a new HashSet to contain the intersection of all the HashSet's
        HashSet<int> first = new HashSet<int>(hashlist.First().Value);
        foreach (var hash in hashlist.Skip(1))
            first.IntersectWith(hash.Value);

        // Show the number of common int's
        Console.WriteLine("Common elements: {0}", first.Count);

我正在寻找的是一种有效的方式（可能是LinQ？）来计算公共元素而不必创建新的HashSet，因为我运行了类似的代码数亿次。

同样重要的是要注意我创建一个新的HashSet来获取交集，因为我不想修改原始的HashSet。

最好的regargs，乔治

Answer 1

这绝对可以改善：

var hashlist = d.Where(x => l.Contains(x.Key));

将其重写为：

var hashlist = l.Select(x => d[x]);

这将利用Dictionary的内部HashSet来有效地获取特定键的值，而不是反复迭代int[]。

您的下一个大问题是Linq is lazy，因此通过单独调用Fist()和Skip(1)，您实际上需要使用前面提到的Where(…)对集合进行多次枚举过滤

为避免多次枚举，您可以重写：

HashSet<int> first = new HashSet<int>(hashlist.First().Value);
foreach (var hash in hashlist.Skip(1))
     first.IntersectWith(hash.Value);

如：

var intersection = hashlist.Aggregate(
    (HashSet<int>)null, 
    (h, j) => 
    {
        if (h == null)
            h = new HashSet<int>(j);
        else 
            h.IntersectWith(j);
        return h; 
    });

但是，根据您的确切用例，首先将结果简单地呈现为List，然后使用简单的for循环，可能会更快（更容易理解）：

var hashlist = l.Select(x => d[x]).ToList();

HashSet<int> first = hashlist[0];
for (var i = 0; i < hashlist.Count; i++)
     first.IntersectWith(hashlist[i]);

以下是这些不同选项的快速基准（您的结果可能会有所不同）：

Original        2.285680 (ms)
SelectHashList  1.912829 
Aggregate       1.815872 
ToListForLoop   1.608565 
OrderEnumerator 1.975067 // Scott Chamberlain's answer
EnumeratorOnly  1.732784 // Scott Chamberlain's answer without the call to OrderBy()
AggIntersect    2.046930 // P. Kouvarakis's answer (with compiler error fixed)
JustCount       1.260448 // Ivan Stoev's updated answer

Answer 2

我正在寻找的是高效方式（可能是LinQ？）计算常见元素

如果你真的希望获得最佳性能，那就忘掉LINQ了，这是一种老式的方式，可以应用所有可能的优化（我能想到）：

// Collect the non empty matching sets, keeping the set with the min Count at position 0
var sets = new HashSet<int>[l.Length];
int setCount = 0;
foreach (var key in l)
{
    HashSet<int> set;
    if (!d.TryGetValue(key, out set) || set.Count == 0) continue;
    if (setCount == 0 || sets[0].Count <= set.Count)
        sets[setCount++] = set;
    else
    {
        sets[setCount++] = sets[0];
        sets[0] = set;
    }
}
int commonCount = 0;
if (setCount > 0)
{
    if (setCount == 1)
        commonCount = sets[0].Count;
    else
    {
        foreach (var item in sets[0])
        {
            bool isCommon = true;
            for (int i = 1; i < setCount; i++)
                if (!sets[i].Contains(item)) { isCommon = false; break; }
            if (isCommon) commonCount++;
        }
    }
}
Console.WriteLine("Common elements: {0}", commonCount);

希望代码是自我解释的。

Answer 3

你可以做的一些技巧可能会为你带来很多加速。我看到的最大的一个是先从最小的一个开始，然后按照你的方式一直到大的一个，这给初始设置提供了与之相交的最小可能数量，从而提供更快的查找。

此外，如果你手动构建你的可数字而不是使用foreach，你不需要枚举列表两次（编辑：也使用技巧p.s.w.g mentioned，选择反对字典而不是使用一个.Contains(）。

重要说明： 如果您要将大量的HashSets与多种项目计数相结合，此方法只会给您带来好处。调用OrderBy的开销很大，并且在您的示例中的小数据集中，您不太可能看到任何好处。

Dictionary<int, HashSet<int>> d = new Dictionary<int, HashSet<int>>();

HashSet<int> s1 = new HashSet<int>() { 3, 4, 5, 6, 7, 8, 9 };
HashSet<int> s2 = new HashSet<int>() { 1, 2, 3, 4, 5, 8 };
HashSet<int> s3 = new HashSet<int>() { 1, 3, 5, 10, 15, 20 };
HashSet<int> s4 = new HashSet<int>() { 1, 20 };

d.Add(10, s1);
d.Add(15, s2);
d.Add(20, s3);
d.Add(25, s4);

// List of keys from which I need the intersection of the HashSet's
int[] l = new int[3] { 10, 15, 20 };

HashSet<int> combined;
//Sort in increasing order by count
//Also used the trick from p.s.w.g's answer to get a better select.
IEnumerable<HashSet<int>> sortedList = l.Select(x => d[x]).OrderBy(x => x.Count);

using (var enumerator = sortedList.GetEnumerator())
{
    if (enumerator.MoveNext())
    {
        combined = new HashSet<int>(enumerator.Current);
    }
    else
    {
        combined = new HashSet<int>();
    }

    while (enumerator.MoveNext())
    {
        combined.IntersectWith(enumerator.Current);
    }
}


// Show the number of common int's
Console.WriteLine("Common elements: {0}", combined.Count);

Answer 4

`IntersectWith（）＆＃39;可能就像你能得到的那样高效。

使用LINQ可以使代码更清晰（？）：

var result = l.Aggregate(null, (acc, key) => acc == null? d[key] : acc.Intersect(d[key]));

c＃Dictionary with HashSet <int> as value get all of all

4 个答案: