Question

我有两个大对象列表。首先（大约 1 000 000个对象）：

public class BaseItem
{
    public BaseItem()
    {

    }

    public double Fee { get; set; } = 0;

    public string Market { get; set; } = string.Empty;

    public string Traider { get; set; } = string.Empty;

    public DateTime DateUtc { get; set; } = new DateTime();
}

第二（大约 20000个对象）：

public class TraiderItem
{
    public TraiderItem()
    {

    }

    public DateTime DateUtc { get; set; } = new DateTime();

    public string Market { get; set; } = string.Empty;

    public string Type { get; set; } = string.Empty;

    public double Price { get; set; } = 0;

    public double Amount { get; set; } = 0;

    public double Total { get; set; } = 0;

    public double Fee { get; set; } = 0;

    public string FeeCoin { get; set; } = string.Empty;
}

当Traider等于并且 Base等于时，我需要在DateUtc个项目中找到所有Fee个项目。现在，我正在使用 Any 方法：

traiderItemsInBase = traiderItems.Where(a => baseItems.Any(x => x.DateUtc == a.DateUtc && Math.Round(x.Fee, 8) == Math.Round((double)a.Fee * 0.4, 8))).ToList();

但是这种方式非常慢。有没有办法提高效率？在这种情况下是否可以使用 HashSet ？

Answer 1

首先，我虽然提出了使用Hashet<>或Dictionary<>的解决方案，但这并不完全适合此用例。通过PLINQ AsParallel()使用更多的内核/线程来加快速度吗？

traiderItemsInBase = traiderItems.AsParallel()
    .Where(a => baseItems.Any(x => x.DateUtc == a.DateUtc &&
                              Math.Round(x.Fee, 8) == Math.Round((double)a.Fee * 0.4, 8)))
    .ToList();

这应该可以很好地扩展，因为这些操作是从您的内存中进行的，而不是查询数据库或其他瓶颈。因此4个内核应该可以将速度提高近4倍。

Answer 2

Imho主延迟-Math.Round-可以减少： 1. for x.Fee：为TraiderItem创建Facade对象，并在其中保存一次计算的FeeRound = x.Fee（或在TraiderItem本身中为FeeRound添加属性）。此数学回合称为1m * 20k次，并且可能不是编译器/ cpu对的强大部分。 2.将第一个lambda转换为函数并在其中计算a.Fee并传递给baseItems.Any（.....）作为参数，如下所示：

traiderItems.Where(a => { var aFeeRound = Math.Round((double)a.Fee * 0.4, 8);
                      return baseItems
                      .Any(x =>
                         x.DateUtc == a.DateUtc && 
                         x.FeeRound == aFeeRound);})
        .ToList();

这样，Math.Round对于每个表达式将只工作一次。如果有错误，抱歉，没有时间进行测试。当然，TPL好主意。祝你好运！

Answer 3

我已经尝试了一些建议，这是迄今为止我能获得的最快的建议：

private static void TestForPreCountingParallel(List<TraiderItem> traiderItems, List<BaseItem> baseItems)
        {
            var watch = new Stopwatch();
            watch.Start();
            ConcurrentBag<TraiderItem> traiderItemsInBase = null;
            for (int i = 0; i < 3; i++)
            {
                traiderItemsInBase = new ConcurrentBag<TraiderItem>();
                var baseFeesRounds = baseItems.Select(bi => Math.Round((double)bi.Fee * 0.4, 8)).ToArray();
                Parallel.ForEach(traiderItems, traiderItem =>
                {
                    double traiderFeeRound = Math.Round(traiderItem.Fee, 8);
                    for (var index = 0; index < baseItems.Count; index++)
                    {
                        var baseItem = baseItems[index];
                        if (traiderItem.DateUtc == baseItem.DateUtc && traiderFeeRound == baseFeesRounds[index])
                        {
                            traiderItemsInBase.Add(traiderItem);
                            break;
                        }
                    }
                });

                Console.WriteLine(i + "," + watch.ElapsedMilliseconds);
            }

            watch.Stop();
            Console.WriteLine("base:{0},traid:{1},res:{2},time:{3}", baseItems.Count, traiderItems.Count,
                traiderItemsInBase.Count, watch.ElapsedMilliseconds);
        }

有人有另一个进步吗？

对于我尝试过的事情，就像这样：

原始Linq：底数：100000，traid：20000，res：40，时间：102544
已转换为foreach循环：基数：100000，traid：20000，res：40，时间：43890
计票费用：基数：100000，交易次数：20000，分辨率：40，时间：22661
并行外循环：base：100000，traid：20000，res：40，time：6823

时间并不重要，趋势是要看的东西。基准测试并不完美，我在BaseItems中对TraiderItems的比率还玩的不多，正如您所看到的，我的比率很低。 100000中的40。

所以只是看一些不同的比率：

base：100000，traid：20000，res：400，time：102417
base：100000，traid：20000，res：400，time：50842
base：100000，traid：20000，res：400，time：21754
base：100000，traid：20000，res：400，time：8296

另一个：

base：100000，traid：20000，res：2000，time：118150
base：100000，traid：20000，res：2000，time：57832
base：100000，traid：20000，res：2000，time：21659
base：100000，traid：20000，res：2000，time：7350

我不是专家，因此我必须参考其他资料，例如： http://mattwarren.org/2016/09/29/Optimising-LINQ/

LINQ有什么问题？

正如Joe Duffy概述的那样，LINQ引入了效率低下的形式   隐藏分配

结论是：

做自己的基准测试，如果真的愿意，请先尝试更改一些代码关心性能。只是增加蛮力以降低效率代码会花钱。

但是我非常喜欢LINQ并且经常使用它。

Answer 4

您可以预先计算两个集合的四舍五入费用。如果它们在最大的收藏中重复很多，则可能按日期将它们分组。

Answer 5

使用LINQ，即Where中的任何内容几乎都像O（N ^ 2）

更好的方法是首先创建一个Key类似于以下内容的HashSet：

DateUtc.ToString("<Format based on matching depth (like Date or upto minutes/secs>")_Fee Rounded.ToString()

并用所有BaseItem对象列表填充它（最坏的情况是HashSet中将有大约100万个项目）（这相当于1个FOR循环）

接下来，循环遍历TraiderItem集合（较小的集合）中的所有项目-像上面一样形成Lookup Key。并签入HashSet。这是另一个For循环。

大约-O（N）+ O（K）的净时间复杂度--->可以通过提前或并行构建HashSet来改善这一点。

空间复杂度更高-但是现在您有太多Ram了：）

Answer 6

它的BaseItem很少，您可以按字典中的日期对其进行分组：

    var baseItemsDic = new Dictionary<DateTime, List<BaseItem>>();
    foreach(var item in baseItems)
    {
        if (!baseItemsDic.ContainsKey(item.DateUtc))
            baseItemsDic.Add(item.DateUtc, new List<BaseItem>());
        baseItemsDic[item.DateUtc].Add(item);
    }


    var traiderItemsInBase = traiderItems.Where(a => baseItemsDic.ContainsKey(a.DateUtc) && baseItemsDic[a.DateUtc].Any(x => Math.Round(x.Fee, 8) == Math.Round((double)a.Fee * 0.4, 8))).ToList();

从另一个大列表中筛选数据上的大列表对象：性能降低

6 个答案: