Question

这一切都与performance有关。我有两个主要lists of objects（此处，我将使用PEOPLE/PERSON作为替补）。首先，我需要filter one list First_Name property - 然后我需要创建two filtered lists from each master list based on shared date - 一个列表只有一个名称，另一个列表包含每个名称，但两个列表只包含匹配日期条目（一个列表中没有在另一个列表中不存在的日期）。我写了一个pseudo-code来简化下面核心问题的问题。请在阅读时了解生日不是最佳选择，因为每个人有多个日期条目。因此，请在阅读以下代码时假装每个人都有大约5,000个“生日”：

public class Person
{
    public string first_Name;
    public string last_Name;
    public DateTime birthday;
}
public class filter_People
{
    List<Person> Group_1 = new List<Person>();// filled from DB Table "1982 Graduates" Group_1 contains all names and all dates
    List<Person> Group_2 = new List<Person>();// filled from DB Table "1983 Graduates" Group_2 contains all names and all dates
    public void filter(List<Person> group_One, List<Person> group_Two)
    {
        Group_1 = group_One;
        Group_2 = group_Two;
        //create a list of distinct first names from Group_1
        List<string> distinct_Group_1_Name = Group_1.Select(p => p.first_Name).Distinct().ToList();

        //Compare each first name in Group_1 to EVERY first name in Group 2, using only records with matching birthdays
        Parallel.For(0, distinct_Group_1_Name.Count, dI => {
            //Step 1 - create a list of person out of group_1 that match the first name being iterated
            List<Person> first_Name_List_1 = Group_1.Where(m => m.first_Name == distinct_Group_1_Name[dI]).ToList();
            //first_Name_List_1 now contains a list of everyone named X (Tom). We need to find people from group 2 who match Tom's birthday - regardless of name

            //step 2 - find matching birthdays by JOINing the filtered name list against Group_2  
            DateTime[] merged_Dates = first_Name_List_1.Join(Group_2, d => d.birthday, b => b.birthday, (d, b) => b.birthday).ToArray();
            //Step 3 - create filtered lists where Filtered_Group_1 contains ONLY people named Tom, and Filtered_Group_2 contains people with ANY name sharing Tom's birthday. No duplicates, no missing dates.
            List<Person> Filtered_Group_1 = first_Name_List_1.Where(p => p.birthday.In(merged_Dates)).ToList();
            List<Person> Filtered_Group_2 = Group_2.Where(p => p.birthday.In(merged_Dates)).ToList();
            //Step 4 -- move on adn process the two filtered lists (outside scope of question)
            //each name in Group_1 will then be compared to EVERY name in Group_2 sharing the same birthday
            //compare_Groups(Filtered_Group_1,Filtered_Group_2)

        });
    }
}
public static class Extension
{
    public static bool In<T>(this T source, params T[] list)
    {
        return list.Contains(source);
    }
}

这里的想法是从数据库中取two different master name lists并创建日期匹配的子列表（一个只有一个名称，另一个有所有名称），允许one-to-many comparison基于具有匹配日期索引的相同长度的datasets。最初的想法是简单地从数据库加载列表，但列表很长并且加载所有名称数据并使用SELECT/WHERE/JOIN要快得多。我说“快得多”，但这是相对的。

我尝试使用密钥将Group_1和Group_2转换为字典并匹配日期。没有太大改善。 Group_1 has about 12Million records（about 4800 distinct names每个都有多个日期），而Group_2也大致相同，因此这里的输入是12Million条记录，输出结果非常多。即使我将这个方法作为一个单独的Task运行并将结果排队等待另一个线程进行处理，它仍然需要永久地拆分这些列表并保持同步。

另外，我意识到这个代码使用类Person没有多大意义，但它只是代表问题，主要使用伪代码。实际上，此方法会在日期对多个数据集进行排序，并将一个数据集与多个数据集进行比较。

如何以更高效的方式完成过滤这一对比的任何帮助将非常感激。

谢谢！

Answer 1

当前格式的代码，我看到太多问题，因为它使用你提到的那种数据变成了性能导向。对于Parallelism和algorithm选项不佳，data structure不是一个神奇的药丸。

目前，对于linear search O(N)进行每次比较，因此对M操作进行M*O(N)，即使我们进行了这些操作O(logN)，甚至更好O(1)，也会是执行时间的重大改进。

而不是使用Distinct，然后使用Parallel loop条件搜索Where，而是使用GroupBy来aggregate / group条记录，并在相同的操作，这将确保轻松搜索具有给定名称的记录

var nameGroupList = Group_1.GroupBy(p => p.first_Name).ToDictionary(p => p.Key, p => p);

这将帮助您摆脱原始代码中的以下两个操作（并行中的一个是重复操作，这会影响性能大的时间）

List<string> distinct_Group_1_Name = Group_1.Select(p => p.first_Name).Distinct().ToList();

List<Person> first_Name_List_1 = Group_1.Where(m => m.first_Name == distinct_Group_1_Name[dI]).ToList();

Dictionary的类型为Dictionary<string,IEnumerable<Person>>，因此您可以在O(1)时间内按名称获取人员列表，并且没有重复搜索。这将代码的另一个问题是重新创建列表，并搜索原始列表/数据。

需要处理的下一部分，这会损害性能，就像这样的代码

p.birthday.In(merged_Dates)

因为在扩展方法中，每次运行list.Contains作为O(N)操作，这会导致性能下降。以下是可能的选择：

从Parallel循环中取出以下操作：

DateTime[] merged_Dates = first_Name_List_1.Join(Group_2, d => d.birthday, b => b.birthday, (d, b) => b.birthday).ToArray();

而是创建另一个类型为Dictionary的{{1}}，通过使用Dictionary<string, Hashset<DateTime>>中的数据交叉来自先前创建的Dictionary<string,IEnumerable<Person>>的数据，您可以使用相应的{{1}对于DateTime，因此可以使用日期列表/数组的现成计算器，并且不需要每次都创建：

Group2

要获得最终结果，请注意，您应将结果存储为IEqualityComparer而不是personDictionary["PersonCode"].Intersect(Group2,IEqualityComparer(using Date))。好处是HashSet将List操作而不是Contains，从而使其更快。事实上，拥有像O(log(N))这样的结构也很好，这将使O(N)运算。

尝试以下几点，并建议代码的工作是否有任何改进。

基于1-many配置中的公共字段数据点过滤两个列表的最快方法

1 个答案: