Question

我遇到这篇文章：

Performance: Compiled vs. Interpreted Regular Expressions，我修改了示例代码以编译1000 Regex，然后每次运行500次以利用预编译，但即使在这种情况下解释的RegExes运行速度提高4倍！

~~这意味着RegexOptions.Compiled选项完全没用，实际上更糟糕的是，它更慢！~~大的差异是由于JIT，在解决JIT编译的正则表达式后，下面的代码仍然执行一点有点慢，对我来说没有意义，只有@Jim in the answers provided a much cleaner version which works as expected。

任何人都可以解释为什么会这样吗？

代码，采取＆amp;从博客文章修改：

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;

namespace RegExTester
{
    class Program
    {
        static void Main(string[] args)
        {
            DateTime startTime = DateTime.Now;

            for (int i = 0; i < 1000; i++)
            {
                CheckForMatches("some random text with email address, address@domain200.com" + i.ToString());    
            }


            double msTaken = DateTime.Now.Subtract(startTime).TotalMilliseconds;
            Console.WriteLine("Full Run: " + msTaken);


            startTime = DateTime.Now;

            for (int i = 0; i < 1000; i++)
            {
                CheckForMatches("some random text with email address, address@domain200.com" + i.ToString());
            }


            msTaken = DateTime.Now.Subtract(startTime).TotalMilliseconds;
            Console.WriteLine("Full Run: " + msTaken);

            Console.ReadLine();

        }


        private static List<Regex> _expressions;
        private static object _SyncRoot = new object();

        private static List<Regex> GetExpressions()
        {
            if (_expressions != null)
                return _expressions;

            lock (_SyncRoot)
            {
                if (_expressions == null)
                {
                    DateTime startTime = DateTime.Now;

                    List<Regex> tempExpressions = new List<Regex>();
                    string regExPattern =
                        @"^[a-zA-Z0-9]+[a-zA-Z0-9._%-]*@{0}$";

                    for (int i = 0; i < 2000; i++)
                    {
                        tempExpressions.Add(new Regex(
                            string.Format(regExPattern,
                            Regex.Escape("domain" + i.ToString() + "." +
                            (i % 3 == 0 ? ".com" : ".net"))),
                            RegexOptions.IgnoreCase));//  | RegexOptions.Compiled
                    }

                    _expressions = new List<Regex>(tempExpressions);
                    DateTime endTime = DateTime.Now;
                    double msTaken = endTime.Subtract(startTime).TotalMilliseconds;
                    Console.WriteLine("Init:" + msTaken);
                }
            }

            return _expressions;
        }

        static  List<Regex> expressions = GetExpressions();

        private static void CheckForMatches(string text)
        {

            DateTime startTime = DateTime.Now;


                foreach (Regex e in expressions)
                {
                    bool isMatch = e.IsMatch(text);
                }


            DateTime endTime = DateTime.Now;
            //double msTaken = endTime.Subtract(startTime).TotalMilliseconds;
            //Console.WriteLine("Run: " + msTaken);

        }
    }
}

Answer 1

编译后的正则表达式在按预期使用时匹配得更快。正如其他人所指出的那样，我们的想法是将它们编译一次并多次使用它们。在许多次运行中，构造和初始化时间为amortized。

我创建了一个更简单的测试，它将向您展示编译的正则表达式无疑比未编译的更快。

    const int NumIterations = 1000;
    const string TestString = "some random text with email address, address@domain200.com";
    const string Pattern = "^[a-zA-Z0-9]+[a-zA-Z0-9._%-]*@domain0\\.\\.com$";
    private static Regex NormalRegex = new Regex(Pattern, RegexOptions.IgnoreCase);
    private static Regex CompiledRegex = new Regex(Pattern, RegexOptions.IgnoreCase | RegexOptions.Compiled);
    private static Regex DummyRegex = new Regex("^.$");

    static void Main(string[] args)
    {
        var DoTest = new Action<string, Regex, int>((s, r, count) =>
            {
                Console.Write("Testing {0} ... ", s);
                Stopwatch sw = Stopwatch.StartNew();
                for (int i = 0; i < count; ++i)
                {
                    bool isMatch = r.IsMatch(TestString + i.ToString());
                }
                sw.Stop();
                Console.WriteLine("{0:N0} ms", sw.ElapsedMilliseconds);
            });

        // Make sure that DoTest is JITed
        DoTest("Dummy", DummyRegex, 1);
        DoTest("Normal first time", NormalRegex, 1);
        DoTest("Normal Regex", NormalRegex, NumIterations);
        DoTest("Compiled first time", CompiledRegex, 1);
        DoTest("Compiled", CompiledRegex, NumIterations);

        Console.WriteLine();
        Console.Write("Done. Press Enter:");
        Console.ReadLine();
    }

将NumIterations设置为500可以为我提供：

Testing Dummy ... 0 ms
Testing Normal first time ... 0 ms
Testing Normal Regex ... 1 ms
Testing Compiled first time ... 13 ms
Testing Compiled ... 1 ms

有500万次迭代，我得到：

Testing Dummy ... 0 ms
Testing Normal first time ... 0 ms
Testing Normal Regex ... 17,232 ms
Testing Compiled first time ... 17 ms
Testing Compiled ... 15,299 ms

在这里，您可以看到编译后的正则表达式比未编译的版本快至少10％。

有趣的是，如果从正则表达式中删除RegexOptions.IgnoreCase，则500万次迭代的结果会更加惊人：

Testing Dummy ... 0 ms
Testing Normal first time ... 0 ms
Testing Normal Regex ... 12,869 ms
Testing Compiled first time ... 14 ms
Testing Compiled ... 8,332 ms

这里，编译的正则表达式比未编译的正则表达式快35％。

在我看来，你引用的博客文章只是一个有缺陷的测试。

Answer 2

http://www.codinghorror.com/blog/2005/03/to-compile-or-not-to-compile.html

编译仅在您实例化一次并重复使用多次时才有帮助。如果你在for循环中创建一个已编译的正则表达式，那么它显然会表现得更糟。你能告诉我们你的示例代码吗？

Answer 3

此基准测试的问题在于，编译的正则表达式具有创建全新程序集并将其加载到AppDomain中的开销。

编译Regex的设计方案（我相信 - 我没有设计它们）有数百个Regex执行数百万次，而不是数千个Regex执行数千次。如果你不打算在一百万次的领域执行正则表达式，你可能甚至不会弥补JIT编译它的时间。

Answer 4

这几乎可以肯定表明您的基准代码编写错误而编译的正则表达式比解释的更快。编译正则表达式的性能有很多工作要做。

现在我们有代码可以查看一些需要更新的具体内容

此代码不考虑方法的JIT成本。它应该运行一次代码以使JIT成本不受影响，然后再次运行并测量
为什么要使用lock？这是完全没必要的
基准测试应使用StopWatch而不是DateTime
要在Compiled和未编译之间进行良好比较，您应该测试单个编译Regex和单个非编译Regex匹配N次的性能。每个正则表达式最多不匹配一次N.

为什么编译RegEx性能比解释的RegEx慢？

4 个答案: