我开始觉得使用正则表达式会降低代码的可维护性。正则表达式的简洁性和强大功能有些恶意。 Perl将其与副作用(如默认运算符)相结合。
我习惯记录正则表达式,其中至少有一个句子给出基本意图,至少有一个匹配的例子。
因为构建了正则表达式,所以我觉得对表达式中每个元素的最大组件进行注释是绝对必要的。尽管如此,即便是我自己的正则表达式让我摸不着头脑,好像我在读克林贡一样。
你是否故意愚弄你的正则表达式?你是否将可能更短,更强大的那些分解成更简单的步骤?我放弃了嵌套正则表达式。是否存在由于可维护性问题而避免使用的正则表达式构造?
不要让这个例子让问题浮现。
如果Michael Ash下面的某些内容存在某种错误,那么你有什么可以做任何事情而不是把它全部抛弃吗?
^(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$
根据请求,可以使用Ash先生的链接找到确切的目的。
匹配 01.1.02 | 11-30-2001 | 2/29/2000
非匹配 02/29/01 | 13/01/2002 | 11/00/02
答案 0 :(得分:32)
使用Expresso给出正则表达式的分层,英语细分。
或
来自Darren Neimke的tip:
.NET允许正则表达式 用嵌入式编写的模式 通过评论 RegExOptions.IgnorePatternWhitespace 编译器选项和(?#...)语法 嵌入在每一行内 模式字符串。
这允许伪代码 要嵌入每行的注释 并具有以下影响 可读性:
Dim re As New Regex ( _
"(?<= (?# Start a positive lookBEHIND assertion ) " & _
"(#|@) (?# Find a # or a @ symbol ) " & _
") (?# End the lookBEHIND assertion ) " & _
"(?= (?# Start a positive lookAHEAD assertion ) " & _
" \w+ (?# Find at least one word character ) " & _
") (?# End the lookAHEAD assertion ) " & _
"\w+\b (?# Match multiple word characters leading up to a word boundary)", _
RegexOptions.Multiline Or RegexOptions.IgnoreCase Or RegexOptions.IgnoreWhitespace _
)
这是另一个.NET示例(需要RegexOptions.Multiline
和RegexOptions.IgnorePatternWhitespace
选项):
static string validEmail = @"\b # Find a word boundary
(?<Username> # Begin group: Username
[a-zA-Z0-9._%+-]+ # Characters allowed in username, 1 or more
) # End group: Username
@ # The e-mail '@' character
(?<Domainname> # Begin group: Domain name
[a-zA-Z0-9.-]+ # Domain name(s), we include a dot so that
# mail.somewhere is also possible
.[a-zA-Z]{2,4} # The top level domain can only be 4 characters
# So .info works, .telephone doesn't.
) # End group: Domain name
\b # Ending on a word boundary
";
如果您的RegEx适用于常见问题,另一种选择是将其记录下来并提交给RegExLib,在那里对其进行评级和评论。没有什么能比很多双眼......
另一个RegEx工具是The Regulator
答案 1 :(得分:19)
我通常只是尝试将所有正则表达式调用包含在自己的函数中,并使用有意义的名称和一些基本注释。我喜欢将正则表达式视为只写语言,只能由编写它的人阅读(除非它非常简单)。我完全希望有人可能需要完全重写表达式,如果他们必须改变其意图,这可能是为了让正则表达式训练保持活力。
答案 2 :(得分:17)
嗯,PCRE / x修饰符的整个生命目的是让你更可读地编写正则表达式,就像这个简单的例子一样:
my $expr = qr/
[a-z] # match a lower-case letter
\d{3,5} # followed by 3-5 digits
/x;
答案 3 :(得分:8)
有些人使用RE来处理错误的事情(我正在等待关于如何使用单个RE检测有效C ++程序的第一个SO问题。)
我经常发现,如果我的RE不能超过60个字符,最好不要成为一段代码,因为这几乎总是更具可读性。
在任何情况下,我始终文档,在代码中,RE应该实现的内容,非常详细。这是因为我从痛苦的经历中知道,对于其他人(或者甚至是我,六个月后)进入并试图理解它有多难。
我不相信他们是邪恶的,虽然我相信一些使用它们的人是邪恶的(不是看着你,Michael Ash :-)。它们是一个很好的工具,但是,就像电锯一样,如果你不知道如何正确使用它们,你会把腿剪掉。
更新:实际上,我刚刚跟踪了那个怪物的链接,这是为了验证1600年到9999年之间的m / d / y格式日期。这是经典的情况完整的代码将更具可读性和可维护性。
您只需将其拆分为三个字段并检查各个值。如果我的一个仆从买了这个,我几乎认为这是一个值得终止的罪行。我当然会把它们送回去写好。
答案 4 :(得分:5)
这是同样的正则表达式分解成易消化的部分。除了更具可读性之外,一些子正则表达式本身也很有用。更改允许的分隔符也非常容易。
#!/usr/local/ActivePerl-5.10/bin/perl
use 5.010; #only 5.10 and above
use strict;
use warnings;
my $sep = qr{ [/.-] }x; #allowed separators
my $any_century = qr/ 1[6-9] | [2-9][0-9] /x; #match the century
my $any_decade = qr/ [0-9]{2} /x; #match any decade or 2 digit year
my $any_year = qr/ $any_century? $any_decade /x; #match a 2 or 4 digit year
#match the 1st through 28th for any month of any year
my $start_of_month = qr/
(?: #match
0?[1-9] | #Jan - Sep or
1[0-2] #Oct - Dec
)
($sep) #the separator
(?:
0?[1-9] | # 1st - 9th or
1[0-9] | #10th - 19th or
2[0-8] #20th - 28th
)
\g{-1} #and the separator again
/x;
#match 28th - 31st for any month but Feb for any year
my $end_of_month = qr/
(?:
(?: 0?[13578] | 1[02] ) #match Jan, Mar, May, Jul, Aug, Oct, Dec
($sep) #the separator
31 #the 31st
\g{-1} #and the separator again
| #or
(?: 0?[13-9] | 1[0-2] ) #match all months but Feb
($sep) #the separator
(?:29|30) #the 29th or the 30th
\g{-1} #and the separator again
)
/x;
#match any non-leap year date and the first part of Feb in leap years
my $non_leap_year = qr/ (?: $start_of_month | $end_of_month ) $any_year/x;
#match 29th of Feb in leap years
#BUG: 00 is treated as a non leap year
#even though 2000, 2400, etc are leap years
my $feb_in_leap = qr/
0?2 #match Feb
($sep) #the separtor
29 #the 29th
\g{-1} #the separator again
(?:
$any_century? #any century
(?: #and decades divisible by 4 but not 100
0[48] |
[2468][048] |
[13579][26]
)
|
(?: #or match centuries that are divisible by 4
16 |
[2468][048] |
[3579][26]
)
00
)
/x;
my $any_date = qr/$non_leap_year|$feb_in_leap/;
my $only_date = qr/^$any_date$/;
say "test against garbage";
for my $date (qw(022900 foo 1/1/1)) {
say "\t$date ", $date ~~ $only_date ? "matched" : "didn't match";
}
say '';
#comprehensive test
my @code = qw/good unmatch month day year leap/;
for my $sep (qw( / - . )) {
say "testing $sep";
my $i = 0;
for my $y ("00" .. "99", 1600 .. 9999) {
say "\t", int $i/8500*100, "% done" if $i++ and not $i % 850;
for my $m ("00" .. "09", 0 .. 13) {
for my $d ("00" .. "09", 1 .. 31) {
my $date = join $sep, $m, $d, $y;
my $re = $date ~~ $only_date || 0;
my $code = not_valid($date);
unless ($re == !$code) {
die "error $date re $re code $code[$code]\n"
}
}
}
}
}
sub not_valid {
state $end = [undef, 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31];
my $date = shift;
my ($m,$d,$y) = $date =~ m{([0-9]+)[-./]([0-9]+)[-./]([0-9]+)};
return 1 unless defined $m; #if $m is set, the rest will be too
#components are in roughly the right ranges
return 2 unless $m >= 1 and $m <= 12;
return 3 unless $d >= 1 and $d <= $end->[$m];
return 4 unless ($y >= 0 and $y <= 99) or ($y >= 1600 and $y <= 9999);
#handle the non leap year case
return 5 if $m == 2 and $d == 29 and not leap_year($y);
return 0;
}
sub leap_year {
my $y = shift;
$y = "19$y" if $y < 1600;
return 1 if 0 == $y % 4 and 0 != $y % 100 or 0 == $y % 400;
return 0;
}
答案 5 :(得分:4)
我学会了避免除最简单的正则表达式之外的所有内容。我更喜欢其他模型,如Icon的字符串扫描或Haskell的解析组合器。在这两种模型中,您都可以编写与内置字符串ops具有相同特权和状态的用户定义代码。如果我在Perl中编程,我可能会在Perl中安装一些解析组合器---我已经为其他语言完成了它。
一个非常好的选择是使用解析表达式语法,因为Roberto Ierusalimschy使用了他的LPEG包,但与解析器组合器不同,这是你不能在下午鞭打的东西。但如果有人已经为你的平台做过PEG,那么它就是正则表达式的一个很好的选择。
答案 6 :(得分:4)
我发现一个很好的方法是简单地将匹配过程分解为几个阶段。它可能没有那么快的执行,但你还有额外的好处,也能够在更精细的谷物水平告诉为什么没有发生匹配。
另一种途径是使用LL或LR解析。有些语言甚至可能用perl的非fsm扩展名表达为正则表达式。
答案 7 :(得分:4)
\d
更改为[0-9]
因为前者匹配Perl 5.8和5.10中的任何数字字符(包括U+1815
MONGOLIAN DIGIT FIVE
:᠕)。
警告,未经测试的代码:
#!/usr/bin/perl
use strict;
use warnings;
my $match_date = qr{
#match 29th - 31st of all months but 2 for the years 1600 - 9999
#with optionally leaving off the first two digits of the year
^
(?:
#match the 31st of 1, 3, 5, 7, 8, 10, and 12
(?: (?: 0? [13578] | 1[02] ) ([/-.]) 31) \1
|
#or match the 29th and 30th of all months but 2
(?: (?: 0? [13-9] | 1[0-2] ) ([/-.]) (?:29|30) \2)
)
(?:
(?: #optionally match the century
1[6-9] | #16 - 19
[2-9][0-9] #20 - 99
)?
[0-9]{2} #match the decade
)
$
|
#or match 29 for 2 for leap years
^
(?:
#FIXME: 00 is treated as a non leap year
#even though 2000, 2400, etc are leap years
0?2 #month 2
([/-.]) #separtor
29 #29th
\3 #separator from before
(?: #leap years
(?:
#match rule 1 (div 4) minus rule 2 (div 100)
(?: #match any century
1[6-9] |
[2-9][0-9]
)?
(?: #match decades divisible by 4 but not 100
0[48] |
[2468][048] |
[13579][26]
)
|
#or match rule 3 (div 400)
(?:
(?: #match centuries that are divisible by 4
16 |
[2468][048] |
[3579][26]
)
00
)
)
)
)
$
|
#or match 1st through 28th for all months between 1600 and 9999
^
(?: (?: 0?[1-9]) | (?:1[0-2] ) ) #all months
([/-.]) #separator
(?:
0?[1-9] | #1st - 9th or
1[0-9] | #10th - 19th or
2[0-8] #20th - 28th
)
\4 #seprator from before
(?:
(?: #optionally match the century
1[6-9] | #16 - 19
[2-9][0-9] #20 - 99
)?
[0-9]{2} #match the decade
)
$
}x;
答案 8 :(得分:3)
有些人在面对的时候 问题,想想“我知道,我会用 正则表达式。“现在他们有 两个问题。 - Jamie Zawinski in comp.lang.emacs。
保持正则表达式尽可能简单(KISS)。在你的日期示例中,我可能会为每个日期类型使用一个正则表达式。
甚至更好,将其替换为库(即日期解析库)。
我还会采取措施确保输入源有一些限制(即只有一种类型的日期字符串,最好是ISO-8601)。
此外,
编辑:
“先进的结构导致 维护问题“
我原来的观点是,如果正确使用 ,它应该导致更简单的表达式,而不是更难的表达式。更简单的表达式应该减少维护。
我已经更新了上面的文字并说了多少。
我想指出,正则表达式几乎不符合高级构造本身的要求。不熟悉某个构造并不会使它成为一个先进的构造,而只是一个不熟悉的构造。这并没有改变正则表达式强大,紧凑和 - 如果使用得当 - 优雅的事实。就像手术刀一样,它完全掌握在使用手术刀的人手中。
答案 9 :(得分:1)
我仍然可以使用它。我只使用Regulator。它允许你做的一件事是保存正则表达式以及它的测试数据。
当然,我也可以添加评论。
这就是Expresso制作的内容。我以前从未使用它,但现在,Regulator失业了:
// using System.Text.RegularExpressions; /// /// Regular expression built for C# on: Thu, Apr 2, 2009, 12:51:56 AM /// Using Expresso Version: 3.0.3276, http://www.ultrapico.com /// /// A description of the regular expression: /// /// Select from 3 alternatives /// ^(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$ /// Beginning of line or string /// Match expression but don't capture it. [(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2)] /// Select from 2 alternatives /// (?:(?:0?[13578]|1[02])(\/|-|\.)31)\1 /// Match expression but don't capture it. [(?:0?[13578]|1[02])(\/|-|\.)31] /// (?:0?[13578]|1[02])(\/|-|\.)31 /// Match expression but don't capture it. [0?[13578]|1[02]] /// Select from 2 alternatives /// 0?[13578] /// 0, zero or one repetitions /// Any character in this class: [13578] /// 1[02] /// 1 /// Any character in this class: [02] /// [1]: A numbered capture group. [\/|-|\.] /// Select from 3 alternatives /// Literal / /// - /// Literal . /// 31 /// Backreference to capture number: 1 /// (?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2) /// Return /// New line /// Match expression but don't capture it. [(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2] /// (?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2 /// Match expression but don't capture it. [0?[13-9]|1[0-2]] /// Select from 2 alternatives /// 0?[13-9] /// 0, zero or one repetitions /// Any character in this class: [13-9] /// 1[0-2] /// 1 /// Any character in this class: [0-2] /// [2]: A numbered capture group. [\/|-|\.] /// Select from 3 alternatives /// Literal / /// - /// Literal . /// Match expression but don't capture it. [29|30] /// Select from 2 alternatives /// 29 /// 29 /// 30 /// 30 /// Backreference to capture number: 2 /// Return /// New line /// Match expression but don't capture it. [(?:1[6-9]|[2-9]\d)?\d{2}] /// (?:1[6-9]|[2-9]\d)?\d{2} /// Match expression but don't capture it. [1[6-9]|[2-9]\d], zero or one repetitions /// Select from 2 alternatives /// 1[6-9] /// 1 /// Any character in this class: [6-9] /// [2-9]\d /// Any character in this class: [2-9] /// Any digit /// Any digit, exactly 2 repetitions /// End of line or string /// ^(?:0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$ /// Beginning of line or string /// Match expression but don't capture it. [0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00)))] /// 0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))) /// 0, zero or one repetitions2 /// [3]: A numbered capture group. [\/|-|\.] /// Select from 3 alternatives /// Literal / /// - /// Literal . /// 29 /// Backreference to capture number: 3 /// Match expression but don't capture it. [(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))] /// Match expression but don't capture it. [(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00)] /// Select from 2 alternatives /// (?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26]) /// Match expression but don't capture it. [1[6-9]|[2-9]\d], zero or one repetitions /// Select from 2 alternatives /// 1[6-9] /// 1 /// Any character in this class: [6-9] /// [2-9]\d /// Any character in this class: [2-9] /// Any digit /// Match expression but don't capture it. [0[48]|[2468][048]|[13579][26]] /// Select from 3 alternatives /// 0[48] /// 0 /// Any character in this class: [48] /// [2468][048] /// Any character in this class: [2468] /// Any character in this class: [048] /// [13579][26] /// Any character in this class: [13579] /// Any character in this class: [26] /// (?:(?:16|[2468][048]|[3579][26])00) /// Return /// New line /// Match expression but don't capture it. [(?:16|[2468][048]|[3579][26])00] /// (?:16|[2468][048]|[3579][26])00 /// Match expression but don't capture it. [16|[2468][048]|[3579][26]] /// Select from 3 alternatives /// 16 /// 16 /// [2468][048] /// Any character in this class: [2468] /// Any character in this class: [048] /// [3579][26] /// Any character in this class: [3579] /// Any character in this class: [26] /// 00 /// End of line or string /// ^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$ /// Beginning of line or string /// Match expression but don't capture it. [(?:0?[1-9])|(?:1[0-2])] /// Select from 2 alternatives /// Match expression but don't capture it. [0?[1-9]] /// 0?[1-9] /// 0, zero or one repetitions /// Any character in this class: [1-9] /// Match expression but don't capture it. [1[0-2]] /// 1[0-2] /// 1 /// Any character in this class: [0-2] /// Return /// New line /// [4]: A numbered capture group. [\/|-|\.] /// Select from 3 alternatives /// Literal / /// - /// Literal . /// Match expression but don't capture it. [0?[1-9]|1\d|2[0-8]] /// Select from 3 alternatives /// 0?[1-9] /// 0, zero or one repetitions /// Any character in this class: [1-9] /// 1\d /// 1 /// Any digit /// 2[0-8] /// 2 /// Any character in this class: [0-8] /// Backreference to capture number: 4 /// Match expression but don't capture it. [(?:1[6-9]|[2-9]\d)?\d{2}] /// (?:1[6-9]|[2-9]\d)?\d{2} /// Match expression but don't capture it. [1[6-9]|[2-9]\d], zero or one repetitions /// Select from 2 alternatives /// 1[6-9] /// 1 /// Any character in this class: [6-9] /// [2-9]\d /// Any character in this class: [2-9] /// Any digit /// Any digit, exactly 2 repetitions /// End of line or string /// /// /// public static Regex regex = new Regex( "^(?:(?:(?:0?[13578]|1[02])(\\/|-|\\.)31)\\1|\r\n(?:(?:0?[13-9]"+ "|1[0-2])(\\/|-|\\.)(?:29|30)\\2))\r\n(?:(?:1[6-9]|[2-9]\\d)?\\d"+ "{2})$|^(?:0?2(\\/|-|\\.)29\\3(?:(?:(?:1[6-9]|[2-9]\\d)?(?:0["+ "48]|[2468][048]|[13579][26])|\r\n(?:(?:16|[2468][048]|[3579][2"+ "6])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))\r\n(\\/|-|\\.)(?:0?[1-9"+ "]|1\\d|2[0-8])\\4(?:(?:1[6-9]|[2-9]\\d)?\\d{2})$", RegexOptions.CultureInvariant | RegexOptions.Compiled );
答案 10 :(得分:1)
我认为维持正则表达式的答案与注释或正则表达式构造不同。
如果我的任务是调试您提供的示例,我会坐在正面的调试工具(如Regex Coach)的前面,并逐步处理它必须处理的数据的正则表达式。
答案 11 :(得分:1)
我发布了question recently about commenting regexes with embedded comments有一些有用的答案,特别是来自@mikej的答案
请参阅Martin Fowler的帖子 ComposedRegex提供了更多的想法 提高正则表达式的可读性。在 总结一下,他提倡打破一个 复杂的regexp成较小的部分 这可以给出有意义的变量 名。 e.g。
答案 12 :(得分:0)
我不希望正则表达式是可读的,所以我只是将它们保留原样,并在需要时重写。