Question

我正在研究一种解决方案，用于获取给定输入的二元组列表和每个二元组的计数。输入大时性能较差；输入460,000个字符和84,000个单词所需的执行时间约为42秒。我更改了代码，现在可以正常运行，但是我不确定是什么导致了性能问题。

注释掉的代码是问题所在。我以为，如果我将bigram和每个bigram的出现以1个循环（而不是2个循环）进行会更好，但是我错了。当通过 List.Where（）传递item参数时，获取列表中某项的索引似乎不太有效。为什么？即使使用 FirstOrDefault（），谓词也会对列表中的每个项目进行评估吗？

我唯一的想法：即使没有对列表中的每个项目评估谓词，我也可以理解为什么使用 List.IndexOf（List.Where（））的速度较慢。如果列表中有84,000个项目，则 FirstOrDefault（）必须循环遍历（我假设），直到找到第一个匹配项（可能在索引0或83,999处），并且对每个项目重复在列表中。

public class Bigram
{
    public string Phrase { get; set; }
    public int Count { get; set; }
}

public List<Bigram> GetSequence(string[] words)
{

  List<Bigram> bigrams = new List<Bigram>();
  List<string> bigramsTemp = new List<string>();

   for (int i = 0; i < words.Length - 1; i++)
    {
       if (string.IsNullOrWhiteSpace(words[i]) == false)
         {
            bigramsTemp.Add(words[i] + " " + words[i + 1]);

             //Bigram bigram = new Bigram()
              //{
                //  Phrase = words[i] + " " + words[i + 1]
               //};

                //bigrams.Add(bigram);

                //var matches = bigrams.Where(p => p.Phrase == bigram.Phrase).Count();

                //if (matches == 0)
                //{
                //    bigram.Count = 1;
                //    bigrams.Add(bigram);
                //}
                //else
                //{
                // int bigramToEdit = 
                //     bigrams.IndexOf(
                //       bigrams.Where(b => b.Phrase == bigram.Phrase).FirstOrDefault());
                //    bigrams[bigramToEdit].Count += 1;
                //}
            }
        }

        var sequences = bigramsTemp.GroupBy(i => i);

        foreach (var s in sequences)
        {
            bigrams.Add(
                new Bigram()
                {
                    Phrase = s.Key,
                    Count = s.Count()
                });
        }

        return bigrams;
    }

Answer 1

从您的初始代码开始，该代码在整个bigrams数组中大约有4个循环

var matches = bigrams.Where(p => p.Phrase == bigram.Phrase).Count();

if (matches == 0)
{
    bigram.Count = 1;
    bigrams.Add(bigram);
}
else
{
    int bigramToEdit = 
     bigrams.IndexOf(
       bigrams.Where(b => b.Phrase == bigram.Phrase).FirstOrDefault());
    bigrams[bigramToEdit].Count += 1;
}

更改为以下内容，它在整个bigrams数组中只有一个循环，而逻辑保持不变

var match = bigrams.FirstOrDefault(b => b.Phrase == bigram.Phrase);
if (match == null)
{
    //match == null means that it does not exist in the array, which is equivalent with Count == 0
    bigram.Count = 1;
    bigrams.Add(bigram);
}
else
{
    //changing the value of match.Count is essentially the same as querying the match again using IndexOf and Where
    match.Count += 1;
}

让我知道更改后的效果

Answer 2

bigrams.Where().FirstOrDefault()循环浏览二元组列表，直到找到第一个匹配项为止。

然后bigrams.IndexOf()再次遍历该列表以找到索引。

这是在bigrams.Where().Count()已经遍历整个列表之后。

每个单词都重复一遍。

一些加快速度的方法：

您可以使用在没有匹配项时FirstOrDefault返回null的事实，然后您可以跳过计数。
有an overload个使用索引的位置，因此您也可以跳过多余的IndexOf步骤。 但是您不需要（如mylee saw），因为您已经有要更新的二元组。

Answer 3

作为@ hans-ke ﬆ ing和@mylee答案的补充，转到字典将进一步帮助您的代码：

a = {'a': 1, 'b': 2}
b = {'c': 3, 'd': 4}

print(dict(**a, **b))
# {'a': 1, 'b': 2, 'c': 3, 'd': 4}

如果您不想更改公共签名，则需要使用以下方法转换为列表：

IDictionary<string, int> bigramsDict = new Dictionary<string, int>();

for (int i = 0; i < words.Length - 1; i++)
{
    if (string.IsNullOrWhiteSpace(words[i]))
    {
        continue;
    }

    string key = words[i] + " " words[i + 1];
    if (!bigramsDict.ContainsKey(key))
        bigramsDict.Add(key, 1);
    else
        bigramsDict[key]++;    
}

性能测试

结果以毫秒为单位：

原始代码：163835.0242。
词典代码：23.76。

foreach (var item in bigramsDict) {
    bigrams.add(new Bigram {Phrase = item.Key, Count = item.Value});
} 

retrun bigrams;

为什么在循环内使用List.IndexOf（List.Where（））时会出现性能问题？

3 个答案:

一些加快速度的方法：

性能测试