Question

我遇到了数据并发处理的问题。我的电脑很快耗尽了RAM。有关如何修复并发实现的任何建议吗？

普通班：

public class CalculationResult
{
    public int Count { get; set; }
    public decimal[] RunningTotals { get; set; }

    public CalculationResult(decimal[] profits)
    {
        this.Count = 1;
        this.RunningTotals = new decimal[12];
        profits.CopyTo(this.RunningTotals, 0);
    }

    public void Update(decimal[] newData)
    {
        this.Count++;

        // summ arrays
        for (int i = 0; i < 12; i++)
            this.RunningTotals[i] = this.RunningTotals[i] + newData[i];
    }

    public void Update(CalculationResult otherResult)
    {
        this.Count += otherResult.Count;

        // summ arrays
        for (int i = 0; i < 12; i++)
            this.RunningTotals[i] = this.RunningTotals[i] + otherResult.RunningTotals[i];
    }
}

代码的单核实现如下：

Dictionary<string, CalculationResult> combinations = new Dictionary<string, CalculationResult>();
foreach (var i in itterations)
{
    // do the processing
    // ..
    string combination = "1,2,3,4,42345,52,523"; // this is determined during the processing

    if (combinations.ContainsKey(combination))
        combinations[combination].Update(newData);
    else
        combinations.Add(combination, new CalculationResult(newData));
}

多核实施：

ConcurrentBag<Dictionary<string, CalculationResult>> results = new ConcurrentBag<Dictionary<string, CalculationResult>>();
Parallel.ForEach(itterations, (i, state) => 
{
    Dictionary<string, CalculationResult> combinations = new Dictionary<string, CalculationResult>();
    // do the processing
    // ..
    // add combination to combinations -> same logic as in single core implementation
    results.Add(combinations);
});
Dictionary<string, CalculationResult> combinationsReal = new Dictionary<string, CalculationResult>();
foreach (var item in results)
{
    foreach (var pair in item)
    {
        if (combinationsReal.ContainsKey(pair.Key))
            combinationsReal[pair.Key].Update(pair.Value);
        else
            combinationsReal.Add(pair.Key, pair.Value);
    }
}

我遇到的问题是，几乎每个combinations词典都会以 930k记录结尾，平均消耗 400 [MB] RAM 记忆。

现在，在单核实现中只有一个这样的字典。所有检查都是针对一个字典执行的。但这是一种缓慢的方法，我想使用多核优化。

在多核实现中，创建了一个ConcurrentBag实例，其中包含所有combinations字典。一旦多线程作业完成 - 所有字典都会聚合成一个。这种方法适用于少量并发迭代。例如，对于4次迭代，我的 RAM 用法为 ~ 1.5 [GB] 。当我设置完全数量的并行迭代（即200）时，问题就出现了！没有任何数量的PC RAM足以容纳所有词典，每个都有数百个记录！

我在考虑使用ConcurrentDictioanary，直到我发现＆＃34; TryAdd＆＃34;在我的情况下，方法不保证添加数据的完整性，因为我还需要在运行总计上运行更新。

唯一真正的多线程选项是将其保存到某个数据库，而不是将所有combinations添加到字典中。数据聚合将是一个带有select子句的SQL group by语句的问题......但我不喜欢创建临时表并运行数据库实例的想法。

是否有解决方案如何同时处理数据而不会耗尽RAM？

修改：也许真正的问题应该是 - 如何在使用RunningTotals时更新ConcurrentDictionary线程安全？我刚刚遇到了这个 thread ，问题与ConcurrentDictionary类似，但我的情况似乎更复杂，因为我有一个需要更新的数组。我还在调查此事。

EDIT2：以下是ConcurrentDictionary的有效解决方案。我需要做的就是为字典键添加一个锁。

ConcurrentDictionary<string, CalculationResult> combinations = new ConcurrentDictionary<string, CalculationResult>();
Parallel.ForEach(itterations, (i, state) => 
{
    // do the processing
    // ..
    string combination = "1,2,3,4,42345,52,523"; // this is determined during the processing

    if (combinations.ContainsKey(combination)) {
        lock(combinations[combination])
            combinations[combination].Update(newData);
    }
    else
        combinations.TryAdd(combination, new CalculationResult(newData));
});

单线程代码执行时间为1m 48s，而此解决方案执行时间为1m 7s，持续4次（性能提升37％）。我仍然想知道SQL方法是否会更快，有数百万条记录？我明天可能会测试一下并更新。

编辑3 ：对于那些想知道值ConcurrentDictionary更新错误的人来说 - 运行带有和没有锁定的代码。

public class Result
{
    public int Count { get; set; }
}

class Program
{
    static void Main(string[] args)
    {
        Console.WriteLine("Start");

        List<int> keys = new List<int>();
        for (int i = 0; i < 100; i++)
            keys.Add(i);

        ConcurrentDictionary<int, Result> dict = new ConcurrentDictionary<int, Result>();
        Parallel.For(0, 8, i =>
        {
            foreach(var key in keys)
            {
                if (dict.ContainsKey(key))
                {
                    //lock (dict[key]) // uncomment this
                        dict[key].Count++;
                }
                else
                    dict.TryAdd(key, new Result());
            }
        });

        // any output here is incorrect behavior. best result = no lines
        foreach (var item in dict)
            if (item.Value.Count != 7) { Console.WriteLine($"{item.Key}; {item.Value.Count}"); }

        Console.WriteLine($"Finish");
        Console.ReadKey();
    }
}

编辑4：经过试验和错误后，我无法优化SQL方法。事实证明这是最糟糕的想法:)我使用了SQL Lite数据库。内存和文件内。使用事务和可重用的SQL命令参数。由于需要插入大量记录 - 性能不足。数据聚合是最容易的部分，但是插入4百万行需要花费大量时间，我甚至无法想象如何有效处理2.4亿个数据。到目前为止（还有奇怪的是，ConcurrentBag方法在我的电脑上似乎是最快的。接下来是ConcurrentDictionary方法。不过，ConcurrentBag的内存有点重。感谢 @Alisson 的工作 - 现在可以将它用于更大的迭代集合了！

Answer 1

因此，您只需要确保您的并发迭代次数不超过4次，这是您的计算机资源的限制，并且只使用此计算机，就没有魔力。

我创建了一个类来控制它将执行的并发执行和并发任务的数量。

该类将包含以下属性：

public class ConcurrentCalculationProcessor
{
    private const int MAX_CONCURRENT_TASKS = 4;
    private readonly IEnumerable<int> _codes;
    private readonly List<Task<Dictionary<string, CalculationResult>>> _tasks;
    private readonly Dictionary<string, CalculationResult> _combinationsReal;

    public ConcurrentCalculationProcessor(IEnumerable<int> codes)
    {
        this._codes = codes;
        this._tasks = new List<Task<Dictionary<string, CalculationResult>>>();
        this._combinationsReal = new Dictionary<string, CalculationResult>();
    }
}

我将并发任务的数量设为const，但可能是构造函数中的参数。

我创建了一个处理处理的方法。出于测试目的，我通过900k itens模拟了一个循环，将它们添加到字典中，最后返回它们：

private async Task<Dictionary<string, CalculationResult>> ProcessCombinations()
{
    Dictionary<string, CalculationResult> combinations = new Dictionary<string, CalculationResult>();
    // do the processing
    // here we should do something that worth using concurrency
    // like querying databases, consuming APIs/WebServices, and other I/O stuff
    for (int i = 0; i < 950000; i++)
        combinations[i.ToString()] = new CalculationResult(new decimal[] { 1, 10, 15 });
    return await Task.FromResult(combinations);
}

主方法将并行启动任务，将它们添加到任务列表中，以便我们最近跟踪它们。

每次列表达到最大并发任务时，我们await称为ProcessRealCombinations的方法。

public async Task<Dictionary<string, CalculationResult>> Execute()
{
    ConcurrentBag<Dictionary<string, CalculationResult>> results = new ConcurrentBag<Dictionary<string, CalculationResult>>();

    for (int i = 0; i < this._codes.Count(); i++)
    {
        // start the task imediately
        var task = ProcessCombinations();
        this._tasks.Add(task);
        if (this._tasks.Count() >= MAX_CONCURRENT_TASKS)
        {
            // if we have more than MAX_CONCURRENT_TASKS in progress, we start processing some of them
            // this will await any of the current tasks to complete, them process it (and any other task which may have been completed as well)...
            await ProcessCompletedTasks().ConfigureAwait(false);
        }
    }

    // keep processing until all the pending tasks have been completed...it should be no more than MAX_CONCURRENT_TASKS
    while(this._tasks.Any())
        await ProcessCompletedTasks().ConfigureAwait(false);

    return this._combinationsReal;
}

下一个方法ProcessCompletedTasks将等待至少一个现有任务完成。之后，它将从列表中完成所有已完成的任务（完成的任务和可能已完成的任何其他任务），并获得它们的结果（组合）。

对于每个processedCombinations，它会与this._combinationsReal合并（使用您在问题中提供的相同逻辑）。

private async Task ProcessCompletedTasks()
{
    await Task.WhenAny(this._tasks).ConfigureAwait(false);
    var completedTasks = this._tasks.Where(t => t.IsCompleted).ToArray();
    // completedTasks will have at least one task, but it may have more ;)
    foreach (var completedTask in completedTasks)
    {
        var processedCombinations = await completedTask.ConfigureAwait(false);
        foreach (var pair in processedCombinations)
        {
            if (this._combinationsReal.ContainsKey(pair.Key))
                this._combinationsReal[pair.Key].Update(pair.Value);
            else
                this._combinationsReal.Add(pair.Key, pair.Value);
        }
        this._tasks.Remove(completedTask);
    }
}

对于processedCombinations中合并的每个_combinationsReal，它将从列表中删除其各自的任务，然后继续（再次开始添加更多任务）。这将在我们为所有迭代创建所有任务之前发生。

最后，我们继续处理它，直到列表中没有其他任务。

如果您监视RAM消耗，您会注意到它将增加到大约1.5 GB（当我们同时处理4个任务时），然后减少到大约0.8 GB（当我们从列表中删除任务时）。至少这是我的电脑里发生的事情。

这是一个fiddle，但我必须将itens的数量从900k减少到100，因为小提琴会限制内存使用以避免滥用。

我希望这能以某种方式帮助你。

关于所有这些内容需要注意的一点是，如果您的ProcessCombinations（处理这些900k项目时同时执行的方法）调用外部资源，您将受益于使用并发任务strong>，比如从HD读取文件，在数据库中执行查询，调用API / WebService方法。我想该代码可能从外部资源中读取900k项，这将减少处理它所需的时间。

如果这些项目之前已加载且ProcessCombinations只是在读取已经在内存中的数据，那么并发性将无法帮助 （实际上我相信它会让你的代码运行得更慢）。如果是这样，那么我们在错误的地方应用并发。

当所述调用要访问外部资源（获取或存储数据）时，并行使用async调用可能会有所帮助，并且根据外部资源可以支持的并发调用数量，它可能会有所帮助仍然没有做出这样的改变。

在并发数据处理过程中如何避免RAM耗尽？

1 个答案: