Question

将正则表达式标记为要编译的正则表达式后，幕后发生了什么？这与缓存的正则表达式有什么不同？

使用这些信息，您如何确定与性能提升相比，计算成本何时可以忽略不计？

Answer 1

RegexOptions.Compiled指示正则表达式引擎使用轻量级代码生成（LCG）将正则表达式表达式编译为IL。此编译在构造对象期间发生，严重使其减慢。反过来，使用正则表达式的匹配更快。

如果您未指定此标志，则您的正则表达式将被视为“已解释”。

举个例子：

public static void TimeAction(string description, int times, Action func)
{
    // warmup
    func();

    var watch = new Stopwatch();
    watch.Start();
    for (int i = 0; i < times; i++)
    {
        func();
    }
    watch.Stop();
    Console.Write(description);
    Console.WriteLine(" Time Elapsed {0} ms", watch.ElapsedMilliseconds);
}

static void Main(string[] args)
{
    var simple = "^\\d+$";
    var medium = @"^((to|from)\W)?(?<url>http://[\w\.:]+)/questions/(?<questionId>\d+)(/(\w|-)*)?(/(?<answerId>\d+))?";
    var complex = @"^(([^<>()[\]\\.,;:\s@""]+"
      + @"(\.[^<>()[\]\\.,;:\s@""]+)*)|("".+""))@"
      + @"((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}"
      + @"\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+"
      + @"[a-zA-Z]{2,}))$";


    string[] numbers = new string[] {"1","two", "8378373", "38737", "3873783z"};
    string[] emails = new string[] { "sam@sam.com", "sss@s", "sjg@ddd.com.au.au", "onelongemail@oneverylongemail.com" };

    foreach (var item in new[] {
        new {Pattern = simple, Matches = numbers, Name = "Simple number match"},
        new {Pattern = medium, Matches = emails, Name = "Simple email match"},
        new {Pattern = complex, Matches = emails, Name = "Complex email match"}
    })
    {
        int i = 0;
        Regex regex;

        TimeAction(item.Name + " interpreted uncached single match (x1000)", 1000, () =>
        {
            regex = new Regex(item.Pattern);
            regex.Match(item.Matches[i++ % item.Matches.Length]);
        });

        i = 0;
        TimeAction(item.Name + " compiled uncached single match (x1000)", 1000, () =>
        {
            regex = new Regex(item.Pattern, RegexOptions.Compiled);
            regex.Match(item.Matches[i++ % item.Matches.Length]);
        });

        regex = new Regex(item.Pattern);
        i = 0;
        TimeAction(item.Name + " prepared interpreted match (x1000000)", 1000000, () =>
        {
            regex.Match(item.Matches[i++ % item.Matches.Length]);
        });

        regex = new Regex(item.Pattern, RegexOptions.Compiled);
        i = 0;
        TimeAction(item.Name + " prepared compiled match (x1000000)", 1000000, () =>
        {
            regex.Match(item.Matches[i++ % item.Matches.Length]);
        });

    }
}

它对3种不同的正则表达式执行4次测试。首先，它测试单个一次性匹配（编译与非编译）。其次，它测试重复使用相同正则表达式的匹配。

我的机器上的结果（在发布时编译，没有附带调试器）

1000次单场比赛（构建正则表达式，匹配和处置）

Type        | Platform | Trivial Number | Simple Email Check | Ext Email Check
------------------------------------------------------------------------------
Interpreted | x86      |    4 ms        |    26 ms           |    31 ms
Interpreted | x64      |    5 ms        |    29 ms           |    35 ms
Compiled    | x86      |  913 ms        |  3775 ms           |  4487 ms
Compiled    | x64      | 3300 ms        | 21985 ms           | 22793 ms

1,000,000次匹配 - 重用Regex对象

Type        | Platform | Trivial Number | Simple Email Check | Ext Email Check
------------------------------------------------------------------------------
Interpreted | x86      |  422 ms        |   461 ms           |  2122 ms
Interpreted | x64      |  436 ms        |   463 ms           |  2167 ms
Compiled    | x86      |  279 ms        |   166 ms           |  1268 ms
Compiled    | x64      |  281 ms        |   176 ms           |  1180 ms

这些结果表明，对于重用Regex对象的情况，编译后的正则表达式可以快达 60％。然而在某些情况下可能会超过 3个数量级来构建。

它还表明，在编写正则表达式时，.NET的 x64版本可以慢5到6倍。

在

的情况下，建议使用已编译的版本

您不关心对象初始化成本，需要额外的性能提升。（注意我们在这里谈论几分之一毫秒）
您关心初始化成本，但是重复使用Regex对象很多次，它会在应用程序生命周期中对其进行补偿。

工作中的扳手，正则表达式缓存

正则表达式引擎包含一个LRU缓存，它包含使用Regex类上的静态方法测试的最后15个正则表达式。

例如：Regex.Replace，Regex.Match等。都使用Regex缓存。

通过设置Regex.CacheSize可以增加缓存的大小。它可以在应用程序的生命周期中随时接受大小的变化。

新的正则表达式仅由Regex类上的静态助手缓存。如果您构造对象，则会检查缓存（以便重复使用和缓冲），但是，您构造的正则表达式未附加到缓存。

此缓存是普通 LRU缓存，它使用简单的双链表实现。如果您碰巧将其增加到5000，并对静态助手使用5000个不同的调用，则每个正则表达式构造将对5000个条目进行爬网以查看它是否先前已被缓存。检查周围有一个 lock ，因此检查可以减少并行性并引入线程阻塞。

这个数字设置得很低，可以保护自己免受这种情况的影响，不过在某些情况下你可能别无选择，只能增加它。

强烈建议将从不将RegexOptions.Compiled选项传递给静态助手。

例如：

\\ WARNING: bad code Regex.IsMatch("10000", @"\\d+", RegexOptions.Compiled)

原因是您在LRU缓存上冒着错过的风险，这将导致超级昂贵的编译。此外，您不知道您所依赖的库正在做什么，因此几乎无法控制或预测缓存的最佳大小。

另请参阅：BCL team blog

注意：这与.NET 2.0和.NET 4.0相关。 4.5中有一些预期的变化可能会导致修改。

Answer 2

BCL团队博客中的此条目提供了一个很好的概述：“Regular Expression performance”。

简而言之，有三种类型的正则表达式（每种都比前一种执行得快）：

<强>解释

快速创建，执行缓慢
编译（您似乎要问的那个）

动态创建速度慢，执行速度快（适合在循环中执行）
<强>预编译

在应用程序的编译时创建（没有运行时创建惩罚），快速执行

因此，如果您打算只执行一次正则表达式，或者在应用程序的非性能关键部分执行（即用户输入验证），则可以使用选项1。

如果您打算在循环中运行正则表达式（即逐行解析文件），则应使用选项2。

如果你的应用程序中有许多永远不会改变的正则表达式并且被强烈使用，那么你可以使用选项3。

Answer 3

应该注意的是，自.NET 2.0以来正则表达式的性能已经通过未编译正则表达式的MRU缓存得到了改进。 Regex库代码不再每次都重新解释相同的未编译正则表达式。

因此，使用编译和动态正则表达式可能会有更大的性能惩罚。除了较慢的加载时间外，系统还使用更多内存将正则表达式编译为操作码。

基本上，当前的建议要么不编译正则表达式，要么预先将它们编译为单独的程序集。

参考：BCL团队博客Regular Expression performance [David Gutierrez]

Answer 4

1）Base Class Library Team on compiled regex

2）Coding Horror, referencing #1 with some good points on the tradeoffs

Answer 5

希望下面的代码能帮助您理解re.compile函数的概念

import re

x="""101 COM    Computers
205 MAT   Mathematics
189 ENG   English
222 SCI Science
333 TA  Tamil
5555 KA  Kannada
6666  TL  Telugu
777777 FR French
"""

#compile reg expression / successfully compiled regex can be used in any regex 
#functions    
find_subject_code=re.compile("\d+",re.M)
#using compiled regex in regex function way - 1
out=find_subject_code.findall(x)
print(out)
#using compiled regex in regex function way - 2
out=re.findall(find_numbers,x)
print(out)

#few more eg:
#find subject name
find_subjectnames=re.compile("(\w+$)",re.M) 
out=find_subjectnames.findall(x)
print(out)


#find subject SHORT name
find_subject_short_names=re.compile("[A-Z]{2,3}",re.M) 
out=find_subject_short_names.findall(x)
print(out)

RegexOptions.Compiled如何工作？

5 个答案:

1000次单场比赛（构建正则表达式，匹配和处置）

1,000,000次匹配 - 重用Regex对象

工作中的扳手，正则表达式缓存