Question

我需要存储大量的键，值对，其中key不唯一。键和值都是字符串。物品数量约为500万。

我的目标是只保留唯一的对。

我尝试使用List<KeyValuePair<string, string>>，但Contains()非常慢。 LINQ Any()看起来要快一点，但仍然太慢。

是否有其他选择可以更快地在通用列表上执行搜索？或者也许我应该使用另一个存储空间？

Answer 1

我会使用Dictionary<string, HashSet<string>>将一个键映射到其所有值。

这是一个完整的解决方案。首先，编写一些扩展方法，将(key,value)对添加到Dictionary，另一个添加(key,value)对，以获得所有string对。请注意，我对键和值使用任意类型，您可以使用public static class Program { public static void Add<TKey, TValue>( this Dictionary<TKey, HashSet<TValue>> data, TKey key, TValue value) { HashSet<TValue> values = null; if (!data.TryGetValue(key, out values)) { // first time using this key? create a new HashSet values = new HashSet<TValue>(); data.Add(key, values); } values.Add(value); } public static IEnumerable<KeyValuePair<TKey, TValue>> KeyValuePairs<TKey, TValue>( this Dictionary<TKey, HashSet<TValue>> data) { return data.SelectMany(k => k.Value, (k, v) => new KeyValuePair<TKey, TValue>(k.Key, v)); } }替换它而不会出现问题。您甚至可以将这些方法写在其他地方而不是扩展名，或者根本不使用方法，只需在程序中的某处使用此代码即可。

public static void Main(string[] args)
{
  Dictionary<string, HashSet<string>> data = new Dictionary<string, HashSet<string>>();
  data.Add("k1", "v1.1");
  data.Add("k1", "v1.2");
  data.Add("k1", "v1.1"); // already in, so nothing happens here
  data.Add("k2", "v2.1");

  foreach (var kv in data.KeyValuePairs())
     Console.WriteLine(kv.Key + " : " + kv.Value);
}

现在您可以按如下方式使用它：

k1 : v1.1
k1 : v1.2
k2 : v2.1

将打印出来：

List<string>

如果您的密钥映射到HashSet<string>，那么您需要自己处理重复项。 {{1}}已经为你做了这件事。

Answer 2

我想Dictionary<string, List<string>>可以解决问题。

Answer 3

我会考虑在他们的网站上使用一些像RavenDB（本例中为RavenDB Embedded）的进程内NoSQL数据库：

RavenDB可用于需要存储数百万条记录且查询时间快的应用程序。

使用它不需要大的样板（来自RavenDB website的例子）：

var myCompany = new Company
                {
                    Name = "Hibernating Rhinos",
                    Employees = {
                                   new Employee
                                   {
                                       Name = "Ayende Rahien"
                                   }
                                 },
                    Country = "Israel"
                };

// Store the company in our RavenDB server
using (var session = documentStore.OpenSession())
{
    session.Store(myCompany);
    session.SaveChanges();
}

// Create a new session, retrieve an entity, and change it a bit
using (var session = documentStore.OpenSession())
{
    Company entity = session.Query<Company>()
        .Where(x => x.Country == "Israel")
        .FirstOrDefault();

    // We can also load by ID: session.Load<Company>(companyId);
    entity.Name = "Another Company";
    session.SaveChanges(); // will send the change to the database
}

Answer 4

如果您使用HashSet<KeyValuePair<string, string>>，则很可能会看到改进。

以下测试在我的机器上完成约10秒钟。如果我改变......

var collection = new HashSet<KeyValuePair<string, string>>();

...到...

var collection = new List<KeyValuePair<string, string>>();

......我厌倦了等待它完成（超过几分钟）。

使用KeyValuePair<string, string>的优势在于，相等性由Key和Value的值决定。由于字符串是实例化的，并且KeyValuePair<TKey, TValue>是结构，因此运行时将认为具有相同Key和Value的对是相同的。

你可以看到与这个测试相等：

    var hs = new HashSet<KeyValuePair<string, string>>();
    hs.Add(new KeyValuePair<string, string>("key", "value"));
    var b = hs.Contains(new KeyValuePair<string, string>("key", "value"));
    Console.WriteLine(b);

但是，重要的是要记住，对的相等性取决于字符串的内容。如果由于某种原因，你的字符串没有被实习（因为它们来自文件或其他东西），那么相等可能不起作用。

using System;
using System.Collections.Generic;
using System.Diagnostics;

namespace ConsoleApplication1 {

    internal class Program {

        static void Main(string[] args) {

            var key = default(string);
            var value = default(string);

            var collection = new HashSet<KeyValuePair<string, string>>();

            for (var i = 0; i < 5000000; i++) {

                if (key == null || i % 2 == 0) {
                    key = "k" + i;
                }
                value = "v" + i;

                collection.Add(new KeyValuePair<string, string>(key, value));
            }

            var found = 0;

            var sw = new Stopwatch();
            sw.Start();
            for (var i = 0; i < 5000000; i++) {

                if (collection.Contains(new KeyValuePair<string, string>("k" + i, "v" + i))) {
                    found++;
                }
            }
            sw.Stop();

            Console.WriteLine("Found " + found);
            Console.WriteLine(sw.Elapsed);
            Console.ReadLine();
        }
    }
}

Answer 5

您是否尝试过使用Hashset？

，虽然我不知道它是否仍然太慢，但比大数据涉及的列表要快得多。

这个答案有很多信息：HashSet vs. List performance

Answer 6

要创建一个唯一列表，您要使用.Distinct()来生成它，而不是.Contains()。但是，无论什么类保持你的字符串，必须正确实现.GetHashCode()和.Equals() 以获得良好的性能，或者必须传入自定义比较器。

以下是使用自定义比较器

的方法

    private static void Main(string[] args)
    {

        List<KeyValuePair<string, string>> giantList = Populate();
        var uniqueItems = giantList.Distinct(new MyStringEquater()).ToList();
    }

    class MyStringEquater : IEqualityComparer<KeyValuePair<string, string>>
    {
        //Choose which comparer you want based on if you want your comparisions to be case sensitive or not
        private static StringComparer comparer = StringComparer.OrdinalIgnoreCase; 

        public bool Equals(KeyValuePair<string, string> x, KeyValuePair<string, string> y)
        {
            return comparer.Equals(x.Key, y.Key) && comparer.Equals(x.Value, y.Value);
        }

        public int GetHashCode(KeyValuePair<string, string> obj)
        {
            unchecked
            {
                int x = 27;
                x = x*11 + comparer.GetHashCode(obj.Key);
                x = x*11 + comparer.GetHashCode(obj.Value);
                return x;
            }
        }
    }

同样根据your comment in the other answer，您还可以在HashSet中使用上述比较器并让它以这种方式存储您的唯一项目。您只需将比较器传递给构造函数。

var hashSetWithComparer = new HashSet<KeyValuePair<string,string>(new MyStringEquater());

通用列表包含（）性能和替代方案

6 个答案: