Question

我有一个奇怪的问题，我有一个ConcurrentDictionary，加载一个文本文件，他的文本文件在磁盘上是3.4Gb，但是当我加载ConcurrentDictionary时，RAM的大小是14GB - 我做错了什么？ / p>

protected ConcurrentDictionary<string, int> BaseVocabulary = new ConcurrentDictionary<string, int>();

public async Task<bool> LoadVocabularyFileAsync(string path)
{

    await Task.Run(() =>
    {
        using (FileStream fs = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
        using (BufferedStream bs = new BufferedStream(fs))
        using (StreamReader sr = new StreamReader(bs))
        {
            string line = string.Empty;

            while ((line = sr.ReadLine()) != null)
            {

                string[] Split = line.Split(' ');
                string Word = Split[0];


                int Index;

                if (!int.TryParse(Split[1], out Index))
                    throw new InvalidDataException("The data Format is invalid!");


                if (!ContainsWord(Word))
                    if (!BaseVocabulary.TryAdd(Word, Index))
                        QueueWord(Word);
            }
        }
    });

    return true;
}

public bool ContainsWord(string word)
{
    return BaseVocabulary.ContainsKey(word);
}

private void QueueWord(string word)
{
    Queue.Add(word);
}

我如何才能提高效率，RAM在我的应用程序中非常重要，我需要释放它，我希望磁盘上的大小是RAM。

编辑：根据要求，每一行都是结构：

the 2000000

因此，Concurrent Dictionary看起来像：

BaseVocabulary.Key = the;
BaseVocabulary.Value = 2000000

希望这有帮助。

Answer 1

我对此进行了相当多的研究，虽然我找不到微软的任何确切信息，但我发现This Website讨论了每个单元使用的内存量。这个测试是用字典而不是并发字典完成的。并行化可能增加了线程安全的开销。

使用ConcurrentDictionary自行重新执行测试，使用1个字符串字符和1个int值，我看到每次添加内容时添加72个字节。

我猜你所看到的是ConcurrentDictionary的开销，除了选择另一种存储数据的方式之外，我不确定你是否会有更好的运气。

也许您的目的可能更容易自己处理同步。

Answer 2

我们似乎有一个更简洁的答案。

它作为线程安全的一部分出现，并发集合确实创建了集合的两个副本： https://referencesource.microsoft.com/#System/sys/system/collections/concurrent/ConcurrentBag.cs,a1bdd7135f94cbdb

((null? los) #f)

所以看起来Concurrent Collections确实制作了两个副本volatile ThreadLocalList m_headList, m_tailList; ThreadLocalList currentList = m_headList; // Acquire the lock to update the m_tailList pointer lock (GlobalListsLock) { if (m_headList == null) { list = new ThreadLocalList(Thread.CurrentThread); m_headList = list; m_tailList = list; } else { list = GetUnownedList(); if (list == null) { list = new ThreadLocalList(Thread.CurrentThread); m_tailList.m_nextList = list; m_tailList = list; } } m_locals.Value = list;和第二个副本m_headList，有效地使List大小加倍，这解释了使用的大约4.6倍的磁盘大小。从大约3.4到大约14Gb。

因此，在.NET中，char是2个字节而不是1个字节（因子2x），然后List再次在并发字典中加倍（再次因子2x），然后在数据结构上产生一点开销更有意义。

使用自定义Dictionary类，我将其降低到11Gb。通过更多的工作，可能更多，加载速度也快得多。

加载文本，内存使用会在磁盘上执行大小

2 个答案: