Question

我有一个包含2.5M行的数据表。我想过滤数据表中的一些行。

数据表的列：

[IntCode] long
[BDIntCode] long
[TxnDT] DateTime
[TxnQuantity] decimal
[RecordUser] long
[RecordDT] DateTime

我的代码如下：

            foreach (var down in breakDowns)
            {
                sw.Start();
                var relatedBreakDowns = firstGroup.Where(x => x.RelatedBDIntCode == down.ProcessingRowIntCode).ToList();
                if (relatedBreakDowns.Count == 0) continue;

                var filters = string.Format("BDIntCode IN ({0})", string.Join(",", relatedBreakDowns.Select(x => x.BDIntCode)));
                var filteredDatatable = datatable.Select(filters, "BDIntCode");
                foreach (var dataRow in filteredDatatable)
                {
                    var r = dataTableSchema.NewRow();
                    r["RecordUser"] = recordUser;
                    r["RecordDT"] = DateTime.Now;
                    r["TxnQuantity"] = dataRow["TxnQuantity"];
                    r["TxnDT"] = dataRow["TxnDT"];
                    r["BDIntCode"] = down.ProcessingRowIntCode;
                    dataTableSchema.Rows.Add(r);
                }
                sw.Stop();
                count++;
                Console.WriteLine("Group: " + unrelatedBreakDownGroup.RelatedBDGroupIntCode + ", Count : " + count + ", ElapsedTime : ms = " + sw.ElapsedMilliseconds + ", sec = " + sw.ElapsedMilliseconds / 1000f );
                sw.Reset();
            }

breakDowns列表的计数是1805，firstGroup列表的计数是9880.

Answer 1

就个人而言，我会先从List<SomeType>开始，而不是数据表。然后我会将数据编入索引：在您的情况下，您正在按RelatedBDIntCode搜索并期待多个匹配，因此：

var index = firstGroup.ToLookup(x => x.RelatedBDIntCode);
foreach (var down in breakDowns) {
    var matches = index[down.ProcessingRowIntCode].ToList();
    //...
}

这可以避免对firstGroup中的每个项目进行breakDowns的完整扫描。

下一个IN可以移到类似索引的搜索中，这次大概是BDIntCode。

Answer 2

只是详细说明Marc的答案 - 您应该尝试减少代码执行的迭代次数。

您的代码当前编写的方式，您正在遍历故障集合1805次，然后对于每个迭代，您在第一组集合上迭代9880次，因此总计17833400次迭代而不考虑数据表过滤器

因此，您的方法应该是预先尝试索引数据，以减少执行的迭代次数。

因此，第一步可以创建RelatedBDIntCode到datatable的正确行的索引映射到字典中。然后你可以遍历breakDowns并为每个down拉出映射的行，如下所示：

var dtIndexed = 
    firstGroup
    .GroupBy(x => x.RelatedBDIntCode)
    .ToDictionary
    (
        x => x.Key, //the RelatedBDIntCode you'll be selecting with 
        x =>        //the mapped rows. This is the same method of filtering, but you could try others
        {
            var filters = string.Format("BDIntCode IN ({0})", string.Join(",", x.Select(y => y.BDIntCode)));
            return datatable.Select(filters, "BDIntCode");
        }    
    );

foreach (var down in breakDowns)
{
    if(!dtIndexed.ContainsKey(down.ProcessingRowIntCode)) continue;

    var rows = dtIndexed[down.ProcessingRowIntCode];

    foreach (var row in rows)
    {
        var r = dataTableSchema.NewRow();
        r["RecordUser"] = recordUser;
        r["RecordDT"] = DateTime.Now;
        r["TxnQuantity"] = row["TxnQuantity"];
        r["TxnDT"] = row["TxnDT"];
        r["BDIntCode"] = down.ProcessingRowIntCode;
        dataTableSchema.Rows.Add(r);
    }
}

这种方法应该减少代码执行的迭代次数，从而提高性能。

请注意，在上面的代码中，我使用了与数据表执行过滤完全相同的方法 - 即datatable.Select(filter, order)。您可以尝试尝试使用datatable.AsEnumerable().Where(row => ...)

在数据表中有更快的过滤方式吗？

2 个答案: