对于大型数据集,c#custom group真的很慢

时间:2011-03-02 21:42:36

标签: c# performance optimization

当我运行以下代码时,campaign.Count()为200,000,此代码非常慢。

List<Campaign> listCampaigns = new List<Campaign>();
        foreach (var item in campaigns)
        {
            if (listCampaigns.Where(a => a.CampaignName == item.CampaignName && a.Term == item.Term).Count() == 0)
            {
                //this doesn't exist
                listCampaigns.Add(item);
            }
            else
            {
                //this exists already
                var campaign = listCampaigns.Where(a => a.CampaignName == item.CampaignName && a.Term == item.Term).First();
                campaign.TotalVisits += item.TotalVisits;
                List<Conversion> listConversions = item.Conversions.ToList();
                listConversions.AddRange(campaign.Conversions.ToList());
                campaign.Conversions = listConversions.ToArray();
            }
        }

是否有优化此代码的部分内容或使用其他方法来加快速度?

任何建议都表示赞赏。感谢。

6 个答案:

答案 0 :(得分:9)

这应该明显更快:

List<Campaign> listCampaigns = new List<Campaign>();
foreach (var g in campaigns.GroupBy(c => new { c.CampaignName, c.Term }))
{
    var campaign = g.First();
    campaign.TotalVisits = g.Sum(x => x.TotalVisits);
    campaign.Conversions = g.SelectMany(c => c.Conversions).ToArray();
    listCampaigns.Add(campaign);
}

答案 1 :(得分:1)

使用。Dictionary<Tuple<string,Term>,Campaign>。您可以将CampaignName和Term放入元组,并使用它来查找O(1)中的现有Campaign。这使得整个代码为O(n)。

我们当前的代码是O(n ^ 2),因为它需要遍历整个列表以检查当前条目是否存在。

代码看起来应该类似于:

var dict=new Dictionary<Tuple<string,Term>,Campaign>();
var currentKey=new Tuple<string,Term>(item.CampaignName, item.Term == item.Term);
Campaign existingCampaign;
if (dict.TryGetValue(currentKey,out existingCampaign))
{
//already exists
}
else
{
//new
}

答案 2 :(得分:1)

在将它们添加到主列表之前,您是否可以避免将200,000个广告系列项目转换为具体列表?

我会:

  • 将Where()。Count()替换为Any()函数,在一般情况下会更快地给出正确答案。
  • 重构ToLists();这些函数采用源集合并将其克隆到新的集合实例中,这非常耗费时间和内存,特别是在这样的循环中。你每次迭代都会创建两个Lists和一个Array;停止!

这是新代码:

List<Campaign> listCampaigns = new List<Campaign>();
    foreach (var item in campaigns)
    {
        if (!listCampaigns.Any(a => a.CampaignName == item.CampaignName && a.Term == item.Term))
        {
            //this doesn't exist
            listCampaigns.Add(item);
        }
        else
        {
            //this exists already
            var campaign = listCampaigns.First(a => a.CampaignName == item.CampaignName && a.Term == item.Term);
            campaign.TotalVisits += item.TotalVisits;
            //Reduces the number of collection copies created per iteration from 3 to 1
            campaign.Conversions = campaignConversions.Concat(item.Conversions).ToArray();
        }
    }

答案 3 :(得分:1)

在那段代码中:

    foreach (var item in campaigns)
    {
        var campaign = listCampaigns.FirstOrDefault(a => a.CampaignName == item.CampaignName && a.Term == item.Term);

        if (campaign == null)
        {
            //this doesn't exist
            listCampaigns.Add(item);
        }
        else
        {
            //this exists already
            campaign.TotalVisits += item.TotalVisits;
            List<Conversion> listConversions = item.Conversions.ToList();
            listConversions.AddRange(campaign.Conversions.ToList());
            campaign.Conversions = listConversions.ToArray();
        }
    }

使用FirstOrDefault避免多次浏览列表。此外,您很可能不会每次都完全评估列表,从而节省了额外的时间。

答案 4 :(得分:0)

至少使用Any()代替Count() - 在这种情况下,您无需查看完整列表:

if (listCampaigns.Where(a => a.CampaignName == item.CampaignName 
                        && a.Term == item.Term).Any())

另外,正如其他人指出快速访问的Dictionary要快得多,你必须为每个Campaign定义一个唯一的键值,然后你就可以使用{{ 1}}

答案 5 :(得分:0)

使用Dictionary<TKey,Campaign>。这样您就可以使用哈希表来检查值是否存在,并在O(1)

中找到相应的值

代码示例:

var dictCampaigns = new Dictionary<Key, Campaign>();
foreach (var item in campaigns)
{
    Campaign found;
    var key = new Key(item);
    if(!dictCampaigns.TryGetValue(key,out found))
    {
        dictCampaigns.Add(key, item);
    }
    else
    {
        found.TotalVisits += item.TotalVisits;
        found.Conversions = (item.Conversions.Concat(found.Conversions)).ToArray();
    }
}

我使用Key结构假设您可能无法使用元组:

struct Key
{
    public readonly string Name;
    public readonly int Term;

    public Key(Campaign camp)
    {
        Name = camp.CampaignName;
        Term = camp.Term;    
    }
}

我用StopWatch大致测量它,它比你的代码快两倍,但我认为仍然可以进行优化。