如何编写更易于维护的正则表达式?

时间:2009-04-02 04:05:52

标签: regex maintenance readability

我开始觉得使用正则表达式会降低代码的可维护性。正则表达式的简洁性和强大功能有些恶意。 Perl将其与副作用(如默认运算符)相结合。

我习惯记录正则表达式,其中至少有一个句子给出基本意图,至少有一个匹配的例子。

因为构建了正则表达式,所以我觉得对表达式中每个元素的最大组件进行注释是绝对必要的。尽管如此,即便是我自己的正则表达式让我摸不着头脑,好像我在读克林贡一样。

你是否故意愚弄你的正则表达式?你是否将可能更短,更强大的那些分解成更简单的步骤?我放弃了嵌套正则表达式。是否存在由于可维护性问题而避免使用的正则表达式构造?

不要让这个例子让问题浮现。

如果Michael Ash下面的某些内容存在某种错误,那么你有什么可以做任何事情而不是把它全部抛弃吗?

^(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$

根据请求,可以使用Ash先生的链接找到确切的目的。

匹配 01.1.02 | 11-30-2001 | 2/29/2000

非匹配 02/29/01 | 13/01/2002 | 11/00/02

13 个答案:

答案 0 :(得分:32)

使用Expresso给出正则表达式的分层,英语细分。

来自Darren Neimke的tip

  

.NET允许正则表达式   用嵌入式编写的模式   通过评论   RegExOptions.IgnorePatternWhitespace   编译器选项和(?#...)语法   嵌入在每一行内   模式字符串。

     

这允许伪代码   要嵌入每行的注释   并具有以下影响   可读性:

Dim re As New Regex ( _
    "(?<=       (?# Start a positive lookBEHIND assertion ) " & _
    "(#|@)      (?# Find a # or a @ symbol ) " & _
    ")          (?# End the lookBEHIND assertion ) " & _
    "(?=        (?# Start a positive lookAHEAD assertion ) " & _
    "   \w+     (?# Find at least one word character ) " & _
    ")          (?# End the lookAHEAD assertion ) " & _
    "\w+\b      (?# Match multiple word characters leading up to a word boundary)", _
    RegexOptions.Multiline Or RegexOptions.IgnoreCase Or RegexOptions.IgnoreWhitespace _
)

这是另一个.NET示例(需要RegexOptions.MultilineRegexOptions.IgnorePatternWhitespace选项):

static string validEmail = @"\b    # Find a word boundary
                (?<Username>       # Begin group: Username
                [a-zA-Z0-9._%+-]+  #   Characters allowed in username, 1 or more
                )                  # End group: Username
                @                  # The e-mail '@' character
                (?<Domainname>     # Begin group: Domain name
                [a-zA-Z0-9.-]+     #   Domain name(s), we include a dot so that
                                   #   mail.somewhere is also possible
                .[a-zA-Z]{2,4}     #   The top level domain can only be 4 characters
                                   #   So .info works, .telephone doesn't.
                )                  # End group: Domain name
                \b                 # Ending on a word boundary
                ";

如果您的RegEx适用于常见问题,另一种选择是将其记录下来并提交给RegExLib,在那里对其进行评级和评论。没有什么能比很多双眼......

另一个RegEx工具是The Regulator

答案 1 :(得分:19)

我通常只是尝试将所有正则表达式调用包含在自己的函数中,并使用有意义的名称和一些基本注释。我喜欢将正则表达式视为只写语言,只能由编写它的人阅读(除非它非常简单)。我完全希望有人可能需要完全重写表达式,如果他们必须改变其意图,这可能是为了让正则表达式训练保持活力。

答案 2 :(得分:17)

嗯,PCRE / x修饰符的整个生命目的是让你更可读地编写正则表达式,就像这个简单的例子一样:

my $expr = qr/
    [a-z]    # match a lower-case letter
    \d{3,5}  # followed by 3-5 digits
/x;

答案 3 :(得分:8)

有些人使用RE来处理错误的事情(我正在等待关于如何使用单个RE检测有效C ++程序的第一个SO问题。)

我经常发现,如果我的RE不能超过60个字符,最好不要成为一段代码,因为这几乎总是更具可读性。

在任何情况下,我始终文档,在代码中,RE应该实现的内容,非常详细。这是因为我从痛苦的经历中知道,对于其他人(或者甚至是我,六个月后)进入并试图理解它有多难。

我不相信他们是邪恶的,虽然我相信一些使用它们的人是邪恶的(不是看着你,Michael Ash :-)。它们是一个很好的工具,但是,就像电锯一样,如果你不知道如何正确使用它们,你会把腿剪掉。

更新:实际上,我刚刚跟踪了那个怪物的链接,这是为了验证1600年到9999年之间的m / d / y格式日期。这是经典的情况完整的代码将更具可读性和可维护性。

您只需将其拆分为三个字段并检查各个值。如果我的一个仆从买了这个,我几乎认为这是一个值得终止的罪行。我当然会把它们送回去写好。

答案 4 :(得分:5)

这是同样的正则表达式分解成易消化的部分。除了更具可读性之外,一些子正则表达式本身也很有用。更改允许的分隔符也非常容易。

#!/usr/local/ActivePerl-5.10/bin/perl

use 5.010; #only 5.10 and above
use strict;
use warnings;

my $sep         = qr{ [/.-] }x;               #allowed separators    
my $any_century = qr/ 1[6-9] | [2-9][0-9] /x; #match the century 
my $any_decade  = qr/ [0-9]{2} /x;            #match any decade or 2 digit year
my $any_year    = qr/ $any_century? $any_decade /x; #match a 2 or 4 digit year

#match the 1st through 28th for any month of any year
my $start_of_month = qr/
    (?:                         #match
        0?[1-9] |               #Jan - Sep or
        1[0-2]                  #Oct - Dec
    )
    ($sep)                      #the separator
    (?: 
        0?[1-9] |               # 1st -  9th or
        1[0-9]  |               #10th - 19th or
        2[0-8]                  #20th - 28th
    )
    \g{-1}                      #and the separator again
/x;

#match 28th - 31st for any month but Feb for any year
my $end_of_month = qr/
    (?:
        (?: 0?[13578] | 1[02] ) #match Jan, Mar, May, Jul, Aug, Oct, Dec
        ($sep)                  #the separator
        31                      #the 31st
        \g{-1}                  #and the separator again
        |                       #or
        (?: 0?[13-9] | 1[0-2] ) #match all months but Feb
        ($sep)                  #the separator
        (?:29|30)               #the 29th or the 30th
        \g{-1}                  #and the separator again
    )
/x;

#match any non-leap year date and the first part of Feb in leap years
my $non_leap_year = qr/ (?: $start_of_month | $end_of_month ) $any_year/x;

#match 29th of Feb in leap years
#BUG: 00 is treated as a non leap year
#even though 2000, 2400, etc are leap years
my $feb_in_leap = qr/
    0?2                         #match Feb
    ($sep)                      #the separtor
    29                          #the 29th
    \g{-1}                      #the separator again
    (?:
        $any_century?           #any century
        (?:                     #and decades divisible by 4 but not 100
            0[48]       | 
            [2468][048] |
            [13579][26]
        )
        |
        (?:                     #or match centuries that are divisible by 4
            16          | 
            [2468][048] |
            [3579][26]
        )
        00                      
    )
/x;

my $any_date  = qr/$non_leap_year|$feb_in_leap/;
my $only_date = qr/^$any_date$/;

say "test against garbage";
for my $date (qw(022900 foo 1/1/1)) {
    say "\t$date ", $date ~~ $only_date ? "matched" : "didn't match";
}
say '';

#comprehensive test

my @code = qw/good unmatch month day year leap/;
for my $sep (qw( / - . )) {
    say "testing $sep";
    my $i  = 0;
    for my $y ("00" .. "99", 1600 .. 9999) {
        say "\t", int $i/8500*100, "% done" if $i++ and not $i % 850;
        for my $m ("00" .. "09", 0 .. 13) {
            for my $d ("00" .. "09", 1 .. 31) {
                my $date = join $sep, $m, $d, $y;
                my $re   = $date ~~ $only_date || 0;
                my $code = not_valid($date);
                unless ($re == !$code) {
                    die "error $date re $re code $code[$code]\n"
                }
            }
        }
    }
}

sub not_valid {
    state $end = [undef, 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31];
    my $date      = shift;
    my ($m,$d,$y) = $date =~ m{([0-9]+)[-./]([0-9]+)[-./]([0-9]+)};
    return 1 unless defined $m; #if $m is set, the rest will be too
    #components are in roughly the right ranges
    return 2 unless $m >= 1 and $m <= 12;
    return 3 unless $d >= 1 and $d <= $end->[$m];
    return 4 unless ($y >= 0 and $y <= 99) or ($y >= 1600 and $y <= 9999);
    #handle the non leap year case
    return 5 if $m == 2 and $d == 29 and not leap_year($y);

    return 0;
}

sub leap_year {
    my $y    = shift;
    $y = "19$y" if $y < 1600;
    return 1 if 0 == $y % 4 and 0 != $y % 100 or 0 == $y % 400;
    return 0;
}

答案 5 :(得分:4)

我学会了避免除最简单的正则表达式之外的所有内容。我更喜欢其他模型,如Icon的字符串扫描或Haskell的解析组合器。在这两种模型中,您都可以编写与内置字符串ops具有相同特权和状态的用户定义代码。如果我在Perl中编程,我可能会在Perl中安装一些解析组合器---我已经为其他语言完成了它。

一个非常好的选择是使用解析表达式语法,因为Roberto Ierusalimschy使用了他的LPEG包,但与解析器组合器不同,这是你不能在下午鞭打的东西。但如果有人已经为你的平台做过PEG,那么它就是正则表达式的一个很好的选择。

答案 6 :(得分:4)

我发现一个很好的方法是简单地将匹配过程分解为几个阶段。它可能没有那么快的执行,但你还有额外的好处,也能够在更精细的谷物水平告诉为什么没有发生匹配。

另一种途径是使用LL或LR解析。有些语言甚至可能用perl的非fsm扩展名表达为正则表达式。

答案 7 :(得分:4)

哇,这太丑了。看起来它应该有效,模拟一个不可避免的错误处理00作为两位数的年份(它应该是闰年的四分之一时间,但没有世纪,你无法知道它应该是什么)。有很多冗余应该被分解为子正则表达式,我会为三个主要情况创建三个子正则表达式(这是我今晚的下一个项目)。我还使用了一个不同的字符作为分隔符,以避免必须转义正斜杠,将单个字符的变换更改为字符类(很高兴让我们避免逃避句点),并将\d更改为[0-9]因为前者匹配Perl 5.8和5.10中的任何数字字符(包括U+1815 MONGOLIAN DIGIT FIVE:᠕)。

警告,未经测试的代码:

#!/usr/bin/perl

use strict;
use warnings;

my $match_date = qr{
    #match 29th - 31st of all months but 2 for the years 1600 - 9999
    #with optionally leaving off the first two digits of the year
    ^
    (?: 
        #match the 31st of 1, 3, 5, 7, 8, 10, and 12
        (?: (?: 0? [13578] | 1[02] ) ([/-.]) 31) \1
        |
        #or match the 29th and 30th of all months but 2
        (?: (?: 0? [13-9] | 1[0-2] ) ([/-.]) (?:29|30) \2)
    )
    (?:
        (?:                      #optionally match the century
            1[6-9] |         #16 - 19
            [2-9][0-9]       #20 - 99
        )?
        [0-9]{2}                 #match the decade
    )
    $
    |
    #or match 29 for 2 for leap years
    ^
    (?:
    #FIXME: 00 is treated as a non leap year 
    #even though 2000, 2400, etc are leap years
        0?2                      #month 2
        ([/-.])                  #separtor
        29                       #29th
        \3                       #separator from before
        (?:                      #leap years
            (?:
                #match rule 1 (div 4) minus rule 2 (div 100)
                (?: #match any century
                    1[6-9] |
                    [2-9][0-9]
                )?
                (?: #match decades divisible by 4 but not 100
                    0[48]       | 
                    [2468][048] |
                    [13579][26]
                )
                |
                #or match rule 3 (div 400)
                (?:
                    (?: #match centuries that are divisible by 4
                        16          | 
                        [2468][048] |
                        [3579][26]
                    )
                    00
                )
            )
        )
    )
    $
    |
    #or match 1st through 28th for all months between 1600 and 9999
    ^
    (?: (?: 0?[1-9]) | (?:1[0-2] ) ) #all months
    ([/-.])                          #separator
    (?: 
        0?[1-9] |                #1st -  9th  or
        1[0-9]  |                #10th - 19th or
        2[0-8]                   #20th - 28th
    )
    \4                               #seprator from before
    (?:                              
        (?:                      #optionally match the century
            1[6-9] |         #16 - 19
            [2-9][0-9]       #20 - 99
        )?
        [0-9]{2}                 #match the decade
    )
    $
}x;

答案 8 :(得分:3)

  有些人在面对的时候   问题,想想“我知道,我会用   正则表达式。“现在他们有   两个问题。 - Jamie Zawinski in   comp.lang.emacs。

保持正则表达式尽可能简单(KISS)。在你的日期示例中,我可能会为每个日期类型使用一个正则表达式。

甚至更好,将其替换为库(即日期解析库)。

我还会采取措施确保输入源有一些限制(即只有一种类型的日期字符串,最好是ISO-8601)。

此外,

  • 当时有一件事(除了提取值之外)
  • 如果使用正确(如简化表达式,从而减少维护),高级构造就可以了。

编辑:

  

“先进的结构导致   维护问题“

我原来的观点是,如果正确使用 ,它应该导致更简单的表达式,而不是更难的表达式。更简单的表达式应该减少维护。

我已经更新了上面的文字并说了多少。

我想指出,正则表达式几乎不符合高级构造本身的要求。不熟悉某个构造并不会使它成为一个先进的构造,而只是一个不熟悉的构造。这并没有改变正则表达式强大,紧凑和 - 如果使用得当 - 优雅的事实。就像手术刀一样,它完全掌握在使用手术刀的人手中。

答案 9 :(得分:1)

我仍然可以使用它。我只使用Regulator。它允许你做的一件事是保存正则表达式以及它的测试数据。

当然,我也可以添加评论。


这就是Expresso制作的内容。我以前从未使用它,但现在,Regulator失业了:

//  using System.Text.RegularExpressions;

/// 
///  Regular expression built for C# on: Thu, Apr 2, 2009, 12:51:56 AM
///  Using Expresso Version: 3.0.3276, http://www.ultrapico.com
///  
///  A description of the regular expression:
///  
///  Select from 3 alternatives
///      ^(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$
///          Beginning of line or string
///          Match expression but don't capture it. [(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2)]
///              Select from 2 alternatives
///                  (?:(?:0?[13578]|1[02])(\/|-|\.)31)\1
///                      Match expression but don't capture it. [(?:0?[13578]|1[02])(\/|-|\.)31]
///                          (?:0?[13578]|1[02])(\/|-|\.)31
///                              Match expression but don't capture it. [0?[13578]|1[02]]
///                                  Select from 2 alternatives
///                                      0?[13578]
///                                          0, zero or one repetitions
///                                          Any character in this class: [13578]
///                                      1[02]
///                                          1
///                                          Any character in this class: [02]
///                              [1]: A numbered capture group. [\/|-|\.]
///                                  Select from 3 alternatives
///                                      Literal /
///                                      -
///                                      Literal .
///                              31
///                      Backreference to capture number: 1
///                  (?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2)
///                      Return
///                      New line
///                      Match expression but don't capture it. [(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2]
///                          (?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2
///                              Match expression but don't capture it. [0?[13-9]|1[0-2]]
///                                  Select from 2 alternatives
///                                      0?[13-9]
///                                          0, zero or one repetitions
///                                          Any character in this class: [13-9]
///                                      1[0-2]
///                                          1
///                                          Any character in this class: [0-2]
///                              [2]: A numbered capture group. [\/|-|\.]
///                                  Select from 3 alternatives
///                                      Literal /
///                                      -
///                                      Literal .
///                              Match expression but don't capture it. [29|30]
///                                  Select from 2 alternatives
///                                      29
///                                          29
///                                      30
///                                          30
///                              Backreference to capture number: 2
///          Return
///          New line
///          Match expression but don't capture it. [(?:1[6-9]|[2-9]\d)?\d{2}]
///              (?:1[6-9]|[2-9]\d)?\d{2}
///                  Match expression but don't capture it. [1[6-9]|[2-9]\d], zero or one repetitions
///                      Select from 2 alternatives
///                          1[6-9]
///                              1
///                              Any character in this class: [6-9]
///                          [2-9]\d
///                              Any character in this class: [2-9]
///                              Any digit
///                  Any digit, exactly 2 repetitions
///          End of line or string
///      ^(?:0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$
///          Beginning of line or string
///          Match expression but don't capture it. [0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00)))]
///              0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00)))
///                  0, zero or one repetitions2
///                  [3]: A numbered capture group. [\/|-|\.]
///                      Select from 3 alternatives
///                          Literal /
///                          -
///                          Literal .
///                  29
///                  Backreference to capture number: 3
///                  Match expression but don't capture it. [(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))]
///                      Match expression but don't capture it. [(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00)]
///                          Select from 2 alternatives
///                              (?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])
///                                  Match expression but don't capture it. [1[6-9]|[2-9]\d], zero or one repetitions
///                                      Select from 2 alternatives
///                                          1[6-9]
///                                              1
///                                              Any character in this class: [6-9]
///                                          [2-9]\d
///                                              Any character in this class: [2-9]
///                                              Any digit
///                                  Match expression but don't capture it. [0[48]|[2468][048]|[13579][26]]
///                                      Select from 3 alternatives
///                                          0[48]
///                                              0
///                                              Any character in this class: [48]
///                                          [2468][048]
///                                              Any character in this class: [2468]
///                                              Any character in this class: [048]
///                                          [13579][26]
///                                              Any character in this class: [13579]
///                                              Any character in this class: [26]
///                              (?:(?:16|[2468][048]|[3579][26])00)
///                                  Return
///                                  New line
///                                  Match expression but don't capture it. [(?:16|[2468][048]|[3579][26])00]
///                                      (?:16|[2468][048]|[3579][26])00
///                                          Match expression but don't capture it. [16|[2468][048]|[3579][26]]
///                                              Select from 3 alternatives
///                                                  16
///                                                      16
///                                                  [2468][048]
///                                                      Any character in this class: [2468]
///                                                      Any character in this class: [048]
///                                                  [3579][26]
///                                                      Any character in this class: [3579]
///                                                      Any character in this class: [26]
///                                          00
///          End of line or string
///      ^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$
///          Beginning of line or string
///          Match expression but don't capture it. [(?:0?[1-9])|(?:1[0-2])]
///              Select from 2 alternatives
///                  Match expression but don't capture it. [0?[1-9]]
///                      0?[1-9]
///                          0, zero or one repetitions
///                          Any character in this class: [1-9]
///                  Match expression but don't capture it. [1[0-2]]
///                      1[0-2]
///                          1
///                          Any character in this class: [0-2]
///          Return
///          New line
///          [4]: A numbered capture group. [\/|-|\.]
///              Select from 3 alternatives
///                  Literal /
///                  -
///                  Literal .
///          Match expression but don't capture it. [0?[1-9]|1\d|2[0-8]]
///              Select from 3 alternatives
///                  0?[1-9]
///                      0, zero or one repetitions
///                      Any character in this class: [1-9]
///                  1\d
///                      1
///                      Any digit
///                  2[0-8]
///                      2
///                      Any character in this class: [0-8]
///          Backreference to capture number: 4
///          Match expression but don't capture it. [(?:1[6-9]|[2-9]\d)?\d{2}]
///              (?:1[6-9]|[2-9]\d)?\d{2}
///                  Match expression but don't capture it. [1[6-9]|[2-9]\d], zero or one repetitions
///                      Select from 2 alternatives
///                          1[6-9]
///                              1
///                              Any character in this class: [6-9]
///                          [2-9]\d
///                              Any character in this class: [2-9]
///                              Any digit
///                  Any digit, exactly 2 repetitions
///          End of line or string
///  
///
/// 
public static Regex regex = new Regex(
      "^(?:(?:(?:0?[13578]|1[02])(\\/|-|\\.)31)\\1|\r\n(?:(?:0?[13-9]"+
      "|1[0-2])(\\/|-|\\.)(?:29|30)\\2))\r\n(?:(?:1[6-9]|[2-9]\\d)?\\d"+
      "{2})$|^(?:0?2(\\/|-|\\.)29\\3(?:(?:(?:1[6-9]|[2-9]\\d)?(?:0["+
      "48]|[2468][048]|[13579][26])|\r\n(?:(?:16|[2468][048]|[3579][2"+
      "6])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))\r\n(\\/|-|\\.)(?:0?[1-9"+
      "]|1\\d|2[0-8])\\4(?:(?:1[6-9]|[2-9]\\d)?\\d{2})$",
    RegexOptions.CultureInvariant
    | RegexOptions.Compiled
    );

答案 10 :(得分:1)

我认为维持正则表达式的答案与注释或正则表达式构造不同。

如果我的任务是调试您提供的示例,我会坐在正面的调试工具(如Regex Coach)的前面,并逐步处理它必须处理的数据的正则表达式。

答案 11 :(得分:1)

我发布了question recently about commenting regexes with embedded comments有一些有用的答案,特别是来自@mikej的答案

  

请参阅Martin Fowler的帖子   ComposedRegex提供了更多的想法   提高正则表达式的可读性。在   总结一下,他提倡打破一个   复杂的regexp成较小的部分   这可以给出有意义的变量   名。 e.g。

答案 12 :(得分:0)

我不希望正则表达式是可读的,所以我只是将它们保留原样,并在需要时重写。