Question

我正在尝试优化搜索大文本文件（300-600mb）中的字符串。使用我目前的方法，花了太长时间。

目前我一直在使用IndexOf来搜索字符串，但是用字符串为每一行构建索引的时间太长（20秒）。

如何优化搜索速度？我试过Contains()，但这也很慢。有什么建议？我正在考虑正则表达式匹配，但我没有看到有显着的速度提升。也许我的搜索逻辑存在缺陷

示例

while ((line = myStream.ReadLine()) != null)
{
    if (line.IndexOf(CompareString, StringComparison.OrdinalIgnoreCase) >= 0)
    {
        LineIndex.Add(CurrentPosition);
        LinesCounted += 1;
    }
}

Answer 1

您正在使用的强力算法在 O（nm）时间执行，其中 n 是要搜索的字符串的长度， m 您要查找的子字符串/模式的长度。您需要使用字符串搜索算法：

Boyer-Moore是“标准”，我想： http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
但还有更多： http://www-igm.univ-mlv.fr/~lecroq/string/
包括Morris-Pratt： http://www.stoimen.com/blog/2012/04/09/computer-algorithms-morris-pratt-string-searching/
和Knuth-Morris-Pratt： http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm

但是，使用精心设计的正则表达式可能就足够了，具体取决于您要查找的内容。请参阅Jeffrey's Friedl的书，Mastering Regular Expressions以获取有关构建高效正则表达式的帮助（例如，无回溯）。

您可能还想查阅一个好的算法文本。我偏爱Robert Sedgewick的Algorithms various incarnations（ [C | C ++ | Java]中的算法）

Answer 2

不幸的是，我不认为你可以在C＃中做很多事情。

我发现Boyer-Moore算法对于这项任务非常快。但是我发现没有办法像IndexOf那样快速地做到这一点。我的假设是，这是因为IndexOf是在手动优化的汇编程序中实现的，而我的代码是在C＃中运行的。

您可以在文章Fast Text Search with Boyer-Moore中查看我的代码和效果测试结果。

Answer 3

您是否看过这些问题（和答案）？

如果您想要做的就是阅读文本文件，那么按照现在的方式进行操作似乎就是这样。其他想法：

如果可以对数据进行预排序，例如将其插入文本文件中，则可能有所帮助。
您可以将数据插入数据库并根据需要进行查询。
您可以使用哈希表

Answer 4

您可以使用regexp.Match（String）。 RegExp匹配更快。

static void Main（）

{

  string text = "One car red car blue car";
  string pat = @"(\w+)\s+(car)";

  // Instantiate the regular expression object.
  Regex r = new Regex(pat, RegexOptions.IgnoreCase);

  // Match the regular expression pattern against a text string.
  Match m = r.Match(text);
  int matchCount = 0;
  while (m.Success) 
  {
     Console.WriteLine("Match"+ (++matchCount));
     for (int i = 1; i <= 2; i++) 
     {
        Group g = m.Groups[i];
        Console.WriteLine("Group"+i+"='" + g + "'");
        CaptureCollection cc = g.Captures;
        for (int j = 0; j < cc.Count; j++) 
        {
           Capture c = cc[j];
           System.Console.WriteLine("Capture"+j+"='" + c + "', Position="+c.Index);
        }
     }
     m = m.NextMatch();
  }

}

c＃搜索大文本文件

4 个答案: