Question

我开始觉得使用正则表达式会降低代码的可维护性。正则表达式的简洁性和强大功能有些恶意。 Perl将其与副作用（如默认运算符）相结合。

我习惯记录正则表达式，其中至少有一个句子给出基本意图，至少有一个匹配的例子。

因为构建了正则表达式，所以我觉得对表达式中每个元素的最大组件进行注释是绝对必要的。尽管如此，即便是我自己的正则表达式让我摸不着头脑，好像我在读克林贡一样。

你是否故意愚弄你的正则表达式？你是否将可能更短，更强大的那些分解成更简单的步骤？我放弃了嵌套正则表达式。是否存在由于可维护性问题而避免使用的正则表达式构造？

不要让这个例子让问题浮现。

如果Michael Ash下面的某些内容存在某种错误，那么你有什么可以做任何事情而不是把它全部抛弃吗？

^(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$

根据请求，可以使用Ash先生的链接找到确切的目的。

匹配 01.1.02 | 11-30-2001 | 2/29/2000

非匹配 02/29/01 | 13/01/2002 | 11/00/02

Answer 1

使用Expresso给出正则表达式的分层，英语细分。

或

来自Darren Neimke的tip：

.NET允许正则表达式   用嵌入式编写的模式   通过评论   RegExOptions.IgnorePatternWhitespace   编译器选项和（？＃...）语法   嵌入在每一行内   模式字符串。

这允许伪代码   要嵌入每行的注释   并具有以下影响   可读性：

Dim re As New Regex ( _
    "(?<=       (?# Start a positive lookBEHIND assertion ) " & _
    "(#|@)      (?# Find a # or a @ symbol ) " & _
    ")          (?# End the lookBEHIND assertion ) " & _
    "(?=        (?# Start a positive lookAHEAD assertion ) " & _
    "   \w+     (?# Find at least one word character ) " & _
    ")          (?# End the lookAHEAD assertion ) " & _
    "\w+\b      (?# Match multiple word characters leading up to a word boundary)", _
    RegexOptions.Multiline Or RegexOptions.IgnoreCase Or RegexOptions.IgnoreWhitespace _
)

这是另一个.NET示例（需要RegexOptions.Multiline和RegexOptions.IgnorePatternWhitespace选项）：

static string validEmail = @"\b    # Find a word boundary
                (?<Username>       # Begin group: Username
                [a-zA-Z0-9._%+-]+  #   Characters allowed in username, 1 or more
                )                  # End group: Username
                @                  # The e-mail '@' character
                (?<Domainname>     # Begin group: Domain name
                [a-zA-Z0-9.-]+     #   Domain name(s), we include a dot so that
                                   #   mail.somewhere is also possible
                .[a-zA-Z]{2,4}     #   The top level domain can only be 4 characters
                                   #   So .info works, .telephone doesn't.
                )                  # End group: Domain name
                \b                 # Ending on a word boundary
                ";

如果您的RegEx适用于常见问题，另一种选择是将其记录下来并提交给RegExLib，在那里对其进行评级和评论。没有什么能比很多双眼......

另一个RegEx工具是The Regulator

Answer 2

我通常只是尝试将所有正则表达式调用包含在自己的函数中，并使用有意义的名称和一些基本注释。我喜欢将正则表达式视为只写语言，只能由编写它的人阅读（除非它非常简单）。我完全希望有人可能需要完全重写表达式，如果他们必须改变其意图，这可能是为了让正则表达式训练保持活力。

Answer 3

嗯，PCRE / x修饰符的整个生命目的是让你更可读地编写正则表达式，就像这个简单的例子一样：

my $expr = qr/
    [a-z]    # match a lower-case letter
    \d{3,5}  # followed by 3-5 digits
/x;

Answer 4

有些人使用RE来处理错误的事情（我正在等待关于如何使用单个RE检测有效C ++程序的第一个SO问题。）

我经常发现，如果我的RE不能超过60个字符，最好不要成为一段代码，因为这几乎总是更具可读性。

在任何情况下，我始终文档，在代码中，RE应该实现的内容，非常详细。这是因为我从痛苦的经历中知道，对于其他人（或者甚至是我，六个月后）进入并试图理解它有多难。

我不相信他们是邪恶的，虽然我相信一些使用它们的人是邪恶的（不是看着你，Michael Ash :-)。它们是一个很好的工具，但是，就像电锯一样，如果你不知道如何正确使用它们，你会把腿剪掉。

更新：实际上，我刚刚跟踪了那个怪物的链接，这是为了验证1600年到9999年之间的m / d / y格式日期。这是经典的情况完整的代码将更具可读性和可维护性。

您只需将其拆分为三个字段并检查各个值。如果我的一个仆从买了这个，我几乎认为这是一个值得终止的罪行。我当然会把它们送回去写好。

Answer 5

这是同样的正则表达式分解成易消化的部分。除了更具可读性之外，一些子正则表达式本身也很有用。更改允许的分隔符也非常容易。

#!/usr/local/ActivePerl-5.10/bin/perl

use 5.010; #only 5.10 and above
use strict;
use warnings;

my $sep         = qr{ [/.-] }x;               #allowed separators    
my $any_century = qr/ 1[6-9] | [2-9][0-9] /x; #match the century 
my $any_decade  = qr/ [0-9]{2} /x;            #match any decade or 2 digit year
my $any_year    = qr/ $any_century? $any_decade /x; #match a 2 or 4 digit year

#match the 1st through 28th for any month of any year
my $start_of_month = qr/
    (?:                         #match
        0?[1-9] |               #Jan - Sep or
        1[0-2]                  #Oct - Dec
    )
    ($sep)                      #the separator
    (?: 
        0?[1-9] |               # 1st -  9th or
        1[0-9]  |               #10th - 19th or
        2[0-8]                  #20th - 28th
    )
    \g{-1}                      #and the separator again
/x;

#match 28th - 31st for any month but Feb for any year
my $end_of_month = qr/
    (?:
        (?: 0?[13578] | 1[02] ) #match Jan, Mar, May, Jul, Aug, Oct, Dec
        ($sep)                  #the separator
        31                      #the 31st
        \g{-1}                  #and the separator again
        |                       #or
        (?: 0?[13-9] | 1[0-2] ) #match all months but Feb
        ($sep)                  #the separator
        (?:29|30)               #the 29th or the 30th
        \g{-1}                  #and the separator again
    )
/x;

#match any non-leap year date and the first part of Feb in leap years
my $non_leap_year = qr/ (?: $start_of_month | $end_of_month ) $any_year/x;

#match 29th of Feb in leap years
#BUG: 00 is treated as a non leap year
#even though 2000, 2400, etc are leap years
my $feb_in_leap = qr/
    0?2                         #match Feb
    ($sep)                      #the separtor
    29                          #the 29th
    \g{-1}                      #the separator again
    (?:
        $any_century?           #any century
        (?:                     #and decades divisible by 4 but not 100
            0[48]       | 
            [2468][048] |
            [13579][26]
        )
        |
        (?:                     #or match centuries that are divisible by 4
            16          | 
            [2468][048] |
            [3579][26]
        )
        00                      
    )
/x;

my $any_date  = qr/$non_leap_year|$feb_in_leap/;
my $only_date = qr/^$any_date$/;

say "test against garbage";
for my $date (qw(022900 foo 1/1/1)) {
    say "\t$date ", $date ~~ $only_date ? "matched" : "didn't match";
}
say '';

#comprehensive test

my @code = qw/good unmatch month day year leap/;
for my $sep (qw( / - . )) {
    say "testing $sep";
    my $i  = 0;
    for my $y ("00" .. "99", 1600 .. 9999) {
        say "\t", int $i/8500*100, "% done" if $i++ and not $i % 850;
        for my $m ("00" .. "09", 0 .. 13) {
            for my $d ("00" .. "09", 1 .. 31) {
                my $date = join $sep, $m, $d, $y;
                my $re   = $date ~~ $only_date || 0;
                my $code = not_valid($date);
                unless ($re == !$code) {
                    die "error $date re $re code $code[$code]\n"
                }
            }
        }
    }
}

sub not_valid {
    state $end = [undef, 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31];
    my $date      = shift;
    my ($m,$d,$y) = $date =~ m{([0-9]+)[-./]([0-9]+)[-./]([0-9]+)};
    return 1 unless defined $m; #if $m is set, the rest will be too
    #components are in roughly the right ranges
    return 2 unless $m >= 1 and $m <= 12;
    return 3 unless $d >= 1 and $d <= $end->[$m];
    return 4 unless ($y >= 0 and $y <= 99) or ($y >= 1600 and $y <= 9999);
    #handle the non leap year case
    return 5 if $m == 2 and $d == 29 and not leap_year($y);

    return 0;
}

sub leap_year {
    my $y    = shift;
    $y = "19$y" if $y < 1600;
    return 1 if 0 == $y % 4 and 0 != $y % 100 or 0 == $y % 400;
    return 0;
}

Answer 6

我学会了避免除最简单的正则表达式之外的所有内容。我更喜欢其他模型，如Icon的字符串扫描或Haskell的解析组合器。在这两种模型中，您都可以编写与内置字符串ops具有相同特权和状态的用户定义代码。如果我在Perl中编程，我可能会在Perl中安装一些解析组合器---我已经为其他语言完成了它。

一个非常好的选择是使用解析表达式语法，因为Roberto Ierusalimschy使用了他的LPEG包，但与解析器组合器不同，这是你不能在下午鞭打的东西。但如果有人已经为你的平台做过PEG，那么它就是正则表达式的一个很好的选择。

Answer 7

我发现一个很好的方法是简单地将匹配过程分解为几个阶段。它可能没有那么快的执行，但你还有额外的好处，也能够在更精细的谷物水平告诉为什么没有发生匹配。

另一种途径是使用LL或LR解析。有些语言甚至可能用perl的非fsm扩展名表达为正则表达式。

Answer 8

哇，这太丑了。看起来它应该有效，模拟一个不可避免的错误处理00作为两位数的年份（它应该是闰年的四分之一时间，但没有世纪，你无法知道它应该是什么）。有很多冗余应该被分解为子正则表达式，我会为三个主要情况创建三个子正则表达式（这是我今晚的下一个项目）。我还使用了一个不同的字符作为分隔符，以避免必须转义正斜杠，将单个字符的变换更改为字符类（很高兴让我们避免逃避句点），并将\d更改为[0-9]因为前者匹配Perl 5.8和5.10中的任何数字字符（包括U+1815 MONGOLIAN DIGIT FIVE：᠕）。

警告，未经测试的代码：

#!/usr/bin/perl

use strict;
use warnings;

my $match_date = qr{
    #match 29th - 31st of all months but 2 for the years 1600 - 9999
    #with optionally leaving off the first two digits of the year
    ^
    (?: 
        #match the 31st of 1, 3, 5, 7, 8, 10, and 12
        (?: (?: 0? [13578] | 1[02] ) ([/-.]) 31) \1
        |
        #or match the 29th and 30th of all months but 2
        (?: (?: 0? [13-9] | 1[0-2] ) ([/-.]) (?:29|30) \2)
    )
    (?:
        (?:                      #optionally match the century
            1[6-9] |         #16 - 19
            [2-9][0-9]       #20 - 99
        )?
        [0-9]{2}                 #match the decade
    )
    $
    |
    #or match 29 for 2 for leap years
    ^
    (?:
    #FIXME: 00 is treated as a non leap year 
    #even though 2000, 2400, etc are leap years
        0?2                      #month 2
        ([/-.])                  #separtor
        29                       #29th
        \3                       #separator from before
        (?:                      #leap years
            (?:
                #match rule 1 (div 4) minus rule 2 (div 100)
                (?: #match any century
                    1[6-9] |
                    [2-9][0-9]
                )?
                (?: #match decades divisible by 4 but not 100
                    0[48]       | 
                    [2468][048] |
                    [13579][26]
                )
                |
                #or match rule 3 (div 400)
                (?:
                    (?: #match centuries that are divisible by 4
                        16          | 
                        [2468][048] |
                        [3579][26]
                    )
                    00
                )
            )
        )
    )
    $
    |
    #or match 1st through 28th for all months between 1600 and 9999
    ^
    (?: (?: 0?[1-9]) | (?:1[0-2] ) ) #all months
    ([/-.])                          #separator
    (?: 
        0?[1-9] |                #1st -  9th  or
        1[0-9]  |                #10th - 19th or
        2[0-8]                   #20th - 28th
    )
    \4                               #seprator from before
    (?:                              
        (?:                      #optionally match the century
            1[6-9] |         #16 - 19
            [2-9][0-9]       #20 - 99
        )?
        [0-9]{2}                 #match the decade
    )
    $
}x;

Answer 9

有些人在面对的时候问题，想想“我知道，我会用正则表达式。“现在他们有两个问题。 - Jamie Zawinski in comp.lang.emacs。

保持正则表达式尽可能简单（KISS）。在你的日期示例中，我可能会为每个日期类型使用一个正则表达式。

甚至更好，将其替换为库（即日期解析库）。

我还会采取措施确保输入源有一些限制（即只有一种类型的日期字符串，最好是ISO-8601）。

此外，

当时有一件事（除了提取值之外）
如果使用正确（如简化表达式，从而减少维护），高级构造就可以了。

编辑：

“先进的结构导致维护问题“

我原来的观点是，如果正确使用，它应该导致更简单的表达式，而不是更难的表达式。更简单的表达式应该减少维护。

我已经更新了上面的文字并说了多少。

我想指出，正则表达式几乎不符合高级构造本身的要求。不熟悉某个构造并不会使它成为一个先进的构造，而只是一个不熟悉的构造。这并没有改变正则表达式强大，紧凑和 - 如果使用得当 - 优雅的事实。就像手术刀一样，它完全掌握在使用手术刀的人手中。

Answer 10

我仍然可以使用它。我只使用Regulator。它允许你做的一件事是保存正则表达式以及它的测试数据。

当然，我也可以添加评论。

这就是Expresso制作的内容。我以前从未使用它，但现在，Regulator失业了：

//  using System.Text.RegularExpressions;

/// 
///  Regular expression built for C# on: Thu, Apr 2, 2009, 12:51:56 AM
///  Using Expresso Version: 3.0.3276, http://www.ultrapico.com
///  
///  A description of the regular expression:
///  
///  Select from 3 alternatives
///      ^(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$
///          Beginning of line or string
///          Match expression but don't capture it. [(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2)]
///              Select from 2 alternatives
///                  (?:(?:0?[13578]|1[02])(\/|-|\.)31)\1
///                      Match expression but don't capture it. [(?:0?[13578]|1[02])(\/|-|\.)31]
///                          (?:0?[13578]|1[02])(\/|-|\.)31
///                              Match expression but don't capture it. [0?[13578]|1[02]]
///                                  Select from 2 alternatives
///                                      0?[13578]
///                                          0, zero or one repetitions
///                                          Any character in this class: [13578]
///                                      1[02]
///                                          1
///                                          Any character in this class: [02]
///                              [1]: A numbered capture group. [\/|-|\.]
///                                  Select from 3 alternatives
///                                      Literal /
///                                      -
///                                      Literal .
///                              31
///                      Backreference to capture number: 1
///                  (?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2)
///                      Return
///                      New line
///                      Match expression but don't capture it. [(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2]
///                          (?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2
///                              Match expression but don't capture it. [0?[13-9]|1[0-2]]
///                                  Select from 2 alternatives
///                                      0?[13-9]
///                                          0, zero or one repetitions
///                                          Any character in this class: [13-9]
///                                      1[0-2]
///                                          1
///                                          Any character in this class: [0-2]
///                              [2]: A numbered capture group. [\/|-|\.]
///                                  Select from 3 alternatives
///                                      Literal /
///                                      -
///                                      Literal .
///                              Match expression but don't capture it. [29|30]
///                                  Select from 2 alternatives
///                                      29
///                                          29
///                                      30
///                                          30
///                              Backreference to capture number: 2
///          Return
///          New line
///          Match expression but don't capture it. [(?:1[6-9]|[2-9]\d)?\d{2}]
///              (?:1[6-9]|[2-9]\d)?\d{2}
///                  Match expression but don't capture it. [1[6-9]|[2-9]\d], zero or one repetitions
///                      Select from 2 alternatives
///                          1[6-9]
///                              1
///                              Any character in this class: [6-9]
///                          [2-9]\d
///                              Any character in this class: [2-9]
///                              Any digit
///                  Any digit, exactly 2 repetitions
///          End of line or string
///      ^(?:0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$
///          Beginning of line or string
///          Match expression but don't capture it. [0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00)))]
///              0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00)))
///                  0, zero or one repetitions2
///                  [3]: A numbered capture group. [\/|-|\.]
///                      Select from 3 alternatives
///                          Literal /
///                          -
///                          Literal .
///                  29
///                  Backreference to capture number: 3
///                  Match expression but don't capture it. [(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))]
///                      Match expression but don't capture it. [(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00)]
///                          Select from 2 alternatives
///                              (?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])
///                                  Match expression but don't capture it. [1[6-9]|[2-9]\d], zero or one repetitions
///                                      Select from 2 alternatives
///                                          1[6-9]
///                                              1
///                                              Any character in this class: [6-9]
///                                          [2-9]\d
///                                              Any character in this class: [2-9]
///                                              Any digit
///                                  Match expression but don't capture it. [0[48]|[2468][048]|[13579][26]]
///                                      Select from 3 alternatives
///                                          0[48]
///                                              0
///                                              Any character in this class: [48]
///                                          [2468][048]
///                                              Any character in this class: [2468]
///                                              Any character in this class: [048]
///                                          [13579][26]
///                                              Any character in this class: [13579]
///                                              Any character in this class: [26]
///                              (?:(?:16|[2468][048]|[3579][26])00)
///                                  Return
///                                  New line
///                                  Match expression but don't capture it. [(?:16|[2468][048]|[3579][26])00]
///                                      (?:16|[2468][048]|[3579][26])00
///                                          Match expression but don't capture it. [16|[2468][048]|[3579][26]]
///                                              Select from 3 alternatives
///                                                  16
///                                                      16
///                                                  [2468][048]
///                                                      Any character in this class: [2468]
///                                                      Any character in this class: [048]
///                                                  [3579][26]
///                                                      Any character in this class: [3579]
///                                                      Any character in this class: [26]
///                                          00
///          End of line or string
///      ^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$
///          Beginning of line or string
///          Match expression but don't capture it. [(?:0?[1-9])|(?:1[0-2])]
///              Select from 2 alternatives
///                  Match expression but don't capture it. [0?[1-9]]
///                      0?[1-9]
///                          0, zero or one repetitions
///                          Any character in this class: [1-9]
///                  Match expression but don't capture it. [1[0-2]]
///                      1[0-2]
///                          1
///                          Any character in this class: [0-2]
///          Return
///          New line
///          [4]: A numbered capture group. [\/|-|\.]
///              Select from 3 alternatives
///                  Literal /
///                  -
///                  Literal .
///          Match expression but don't capture it. [0?[1-9]|1\d|2[0-8]]
///              Select from 3 alternatives
///                  0?[1-9]
///                      0, zero or one repetitions
///                      Any character in this class: [1-9]
///                  1\d
///                      1
///                      Any digit
///                  2[0-8]
///                      2
///                      Any character in this class: [0-8]
///          Backreference to capture number: 4
///          Match expression but don't capture it. [(?:1[6-9]|[2-9]\d)?\d{2}]
///              (?:1[6-9]|[2-9]\d)?\d{2}
///                  Match expression but don't capture it. [1[6-9]|[2-9]\d], zero or one repetitions
///                      Select from 2 alternatives
///                          1[6-9]
///                              1
///                              Any character in this class: [6-9]
///                          [2-9]\d
///                              Any character in this class: [2-9]
///                              Any digit
///                  Any digit, exactly 2 repetitions
///          End of line or string
///  
///
/// 
public static Regex regex = new Regex(
      "^(?:(?:(?:0?[13578]|1[02])(\\/|-|\\.)31)\\1|\r\n(?:(?:0?[13-9]"+
      "|1[0-2])(\\/|-|\\.)(?:29|30)\\2))\r\n(?:(?:1[6-9]|[2-9]\\d)?\\d"+
      "{2})$|^(?:0?2(\\/|-|\\.)29\\3(?:(?:(?:1[6-9]|[2-9]\\d)?(?:0["+
      "48]|[2468][048]|[13579][26])|\r\n(?:(?:16|[2468][048]|[3579][2"+
      "6])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))\r\n(\\/|-|\\.)(?:0?[1-9"+
      "]|1\\d|2[0-8])\\4(?:(?:1[6-9]|[2-9]\\d)?\\d{2})$",
    RegexOptions.CultureInvariant
    | RegexOptions.Compiled
    );

Answer 11

我认为维持正则表达式的答案与注释或正则表达式构造不同。

如果我的任务是调试您提供的示例，我会坐在正面的调试工具（如Regex Coach）的前面，并逐步处理它必须处理的数据的正则表达式。

Answer 12

我发布了question recently about commenting regexes with embedded comments有一些有用的答案，特别是来自@mikej的答案

请参阅Martin Fowler的帖子 ComposedRegex提供了更多的想法提高正则表达式的可读性。在总结一下，他提倡打破一个复杂的regexp成较小的部分这可以给出有意义的变量名。 e.g。

Answer 13

我不希望正则表达式是可读的，所以我只是将它们保留原样，并在需要时重写。

如何编写更易于维护的正则表达式？

13 个答案: