大量项目的C#字典

时间:2018-08-31 11:42:56

标签: c# dictionary

我想了解在C#中将大量项目存储在内存中的成本。我需要使用的数据结构是字典或类似的东西。假设我希望拥有的商品数量约为1亿个,但应用程序不会立即达到该数量。我们要花很长时间才能达到极限。

我担心摊销的运营成本,但是我无法承受任何特定时刻的昂贵成本。因此,通常对于动态数据结构,当结构已满时,它会自行重新分配。对于字典,我认为它甚至可以重新索引每个项目。因此,我们可以说应用程序可以维护2000万个项目,而这恰好达到了字典的容量。然后,当分配新的字典存储时,需要重新索引这2000万个项目。

这就是为什么我认为一系列字典可能是一个好主意的原因。假设我创建了256个字典,这立即将每个内部字典的大小限制为少于一百万个条目,这应该是可管理的,可以动态建立,而所有索引工作最多可以达到一百万个条目。这样做的代价似乎是每次操作都要多找一个索引,以找到要查找的正确字典。

这是一种合理的方法吗?我的分析是正确的还是出于某些原因我认为C#词典的性能会更好吗?还有其他更好的解决方案吗?我正在寻找一种具有与C#字典相同的时间复杂度的数据结构。

编辑:字典键是一个随机值,因此我可以一口咬一下,以非常便宜的价格找到256个字典数组中的索引。

我目前不考虑使用数据库,因为我希望所有项目都能以很少的成本立即获得。我确实需要不间断的查找,而且开销很小。我可以承受插入速度较慢但仍保持恒定的时间。与删除相同,可能会稍慢一些,但需要一定的时间。

应该有可能适合存储器中的所有项目。这些项很小,每个项大约有50个字节的数据。因此,每个项目的数据结构都不应有太多开销。

2 个答案:

答案 0 :(得分:4)

更新:自发布以来,我已经对其进行了编辑:

  • 存储固定大小的对象(每次通过时,字节[50]
  • 在添加到字典之前(而不是在循环中创建对象)预先分配所有这些信息
  • 在预分配内容后运行GC.Collect()。
  • gcAllowVeryLargeObjects设置为true。
  • 绝对为x64设置(以前,但我切换到“发布”以在VS之外构建和运行...并且重置,哎呀。)
  • 在有和没有预分配字典大小的情况下进行了尝试。

下面是代码:

var arrays = new byte[100000000][];
System.Diagnostics.Stopwatch stopwatch = new System.Diagnostics.Stopwatch();
stopwatch.Start();
for (var i = 0; i<100000000; i++)
{
    arrays[i] = new byte[50];
}
stopwatch.Stop();
Console.WriteLine($"initially allocating arrays took {stopwatch.ElapsedMilliseconds} ms");
stopwatch.Restart();

GC.Collect();
Console.WriteLine($"GC after array allocation took {stopwatch.ElapsedMilliseconds} ms");

Dictionary<int, byte[]> dict = new Dictionary<int, byte[]>(100000000);
//Dictionary<int, byte[]> dict = new Dictionary<int, byte[]>();

for (var c = 0; c < 100; c++)
{
    stopwatch.Restart();
    for (var i = 0; i < 1000000; i++)
    {
        var thing = new AThing();
        dict.Add((c * 1000000) + i, arrays[(c*100000)+i]);
    }
    stopwatch.Stop();
    Console.WriteLine($"pass number {c} took {stopwatch.ElapsedMilliseconds} milliseconds");
}

Console.ReadLine();

这是我不预先分配字典大小时的输出:

initially allocating arrays took 14609 ms
GC after array allocation took 3713 ms
pass number 0 took 63 milliseconds
pass number 1 took 51 milliseconds
pass number 2 took 78 milliseconds
pass number 3 took 28 milliseconds
pass number 4 took 32 milliseconds
pass number 5 took 133 milliseconds
pass number 6 took 41 milliseconds
pass number 7 took 31 milliseconds
pass number 8 took 27 milliseconds
pass number 9 took 26 milliseconds
pass number 10 took 45 milliseconds
pass number 11 took 335 milliseconds
pass number 12 took 34 milliseconds
pass number 13 took 35 milliseconds
pass number 14 took 71 milliseconds
pass number 15 took 66 milliseconds
pass number 16 took 64 milliseconds
pass number 17 took 58 milliseconds
pass number 18 took 71 milliseconds
pass number 19 took 65 milliseconds
pass number 20 took 68 milliseconds
pass number 21 took 67 milliseconds
pass number 22 took 83 milliseconds
pass number 23 took 11986 milliseconds
pass number 24 took 7948 milliseconds
pass number 25 took 38 milliseconds
pass number 26 took 36 milliseconds
pass number 27 took 27 milliseconds
pass number 28 took 31 milliseconds
..SNIP lots between 30-40ms...
pass number 44 took 34 milliseconds
pass number 45 took 34 milliseconds
pass number 46 took 33 milliseconds
pass number 47 took 2630 milliseconds
pass number 48 took 12255 milliseconds
pass number 49 took 33 milliseconds
...SNIP a load of lines which are all between 30 to 50ms...
pass number 93 took 39 milliseconds
pass number 94 took 43 milliseconds
pass number 95 took 7056 milliseconds
pass number 96 took 33323 milliseconds
pass number 97 took 228 milliseconds
pass number 98 took 70 milliseconds
pass number 99 took 84 milliseconds

您可以清楚地看到必须重新分配的点。我只是通过将列表的大小加倍并复制当前列表项来进行猜测,因为结尾处的内容很长,无法执行此操作。尽管其中一些非常昂贵(超过30秒!哎呀)

这是我预先分配字典大小的输出:

initially allocating arrays took 15494 ms
GC after array allocation took 2622 ms
pass number 0 took 9585 milliseconds
pass number 1 took 107 milliseconds
pass number 2 took 91 milliseconds
pass number 3 took 145 milliseconds
pass number 4 took 83 milliseconds
pass number 5 took 118 milliseconds
pass number 6 took 133 milliseconds
pass number 7 took 126 milliseconds
pass number 8 took 65 milliseconds
pass number 9 took 52 milliseconds
pass number 10 took 42 milliseconds
pass number 11 took 34 milliseconds
pass number 12 took 45 milliseconds
pass number 13 took 48 milliseconds
pass number 14 took 46 milliseconds
..SNIP lots between 30-80ms...
pass number 45 took 80 milliseconds
pass number 46 took 65 milliseconds
pass number 47 took 64 milliseconds
pass number 48 took 65 milliseconds
pass number 49 took 122 milliseconds
pass number 50 took 103 milliseconds
pass number 51 took 45 milliseconds
pass number 52 took 77 milliseconds
pass number 53 took 64 milliseconds
pass number 54 took 96 milliseconds

.. SNIP介于30-80毫秒之间...     通过号码77耗时44毫秒     通过号码78耗时85毫秒     通过数字79耗时142毫秒     通过号码80耗时138毫秒     通过数字81耗时47毫秒     通过号码82耗时44毫秒   ..SNIP很多介于30-80毫秒之间...     通过数字93花费了52毫秒     通过数字94花费了50毫秒     通过数字95耗时63毫秒     通过号码96耗时111毫秒     通过数字97耗时175毫秒     通过数字98花费了96毫秒     99号通行证耗时67毫秒

在最初创建阵列时,内存使用量上升到9GB以上,在GC.Collect之后下降到大约6.5GB,在添加到字典中后又上升到9GB以上,然后完成所有操作(等待整个过程)稍等一会儿,它就会降到〜3.7GB并停留在那里。

显然在操作上可以更快地预分配字典。

***供参考,以下为原始****

我只是写了这个小测试。我不知道您要存储什么,所以我刚刚创建了一个没有太多信息的毫无意义的类,并使用int作为键,但是我从中得到的两个收获是:

  1. 添加到字典中似乎不会越来越慢,直到达到4000万左右。运行针对x64的“发布”版本,每百万次插入大约花费500毫秒,然后41到46花费的时间更像是700-850毫秒(那时候明显的跳跃)

  2. 它的存储量超过了46,000,000,消耗了大约4GB的RAM,并因内存不足而崩溃。

  3. 使用数据库,否则Google字典滥用小组会带您下山。

代码:

class AThing
{
    public string Name { get; set; }
    public int id { get; set; }
}
class Program
{
    static void Main(string[] args)
    {
        Dictionary<int, AThing> dict = new Dictionary<int, AThing>();

        for (var c = 0; c < 100; c++)
        {
            DateTime nowTime = DateTime.Now;
            for (var i = 0; i < 1000000; i++)
            {
                var thing = new AThing { id = (c * 1000000) + i, Name = $"Item {(c * 1000000) + i}" };
                dict.Add(thing.id, thing);
            }
            var timeTaken = DateTime.Now - nowTime;
            Console.WriteLine($"pass number {c} took {timeTaken.Milliseconds} milliseconds");
        }

    }
}

答案 1 :(得分:0)

如果期望程序在字典达到最大大小时运行,那么为什么不从一开始就将其分配给最大大小,并避免重新建立索引等。使用的内存量只是暂时不同于其他解决方案,但是节省的时间不是暂时的,而且,当字典处于空状态时,发生冲突的可能性非常低。