这个复杂的正则表达式做了什么?

时间:2011-06-19 07:06:36

标签: regex perl

我很难搞清楚这一点:

( $dwg, $rev, $rest ) = ($file =~ /^(\d{3}[_-][\w\d]{3}[_-]\d{3,4}(?:[_-]\d{3,4})?)(?:[_ -]\w)?[_ ]{0,5}[rR](?:[eE][vV])?(?:\.)? ?([\w\d-]?) *(.*)/);

4 个答案:

答案 0 :(得分:12)

YAPE::Regex::Explain是一个模块,它接受任何正则表达式作为输入,并且输出提供正则表达式的作用的解释。这是一个例子:

use Modern::Perl;
use YAPE::Regex::Explain;

my $re = qr/^(\d{3}[_-][\w\d]{3}[_-]\d{3,4}(?:[_-]\d{3,4})?)(?:[_ -]\w)?[_ ]{0,5}[rR](?:[eE][vV])?(?:\.)? ?([\w\d-]?) *(.*)/;

say YAPE::Regex::Explain->new($re)->explain();

这是输出:

The regular expression:

(?-imsx:^(\d{3}[_-][\w\d]{3}[_-]\d{3,4}(?:[_-]\d{3,4})?)(?:[_ -]\w)?[_ ]{0,5}[rR](?:[eE][vV])?(?:\.)? ?([\w\d-]?) *(.*))

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  ^                        the beginning of the string
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    \d{3}                    digits (0-9) (3 times)
----------------------------------------------------------------------
    [_-]                     any character of: '_', '-'
----------------------------------------------------------------------
    [\w\d]{3}                any character of: word characters (a-z,
                             A-Z, 0-9, _), digits (0-9) (3 times)
----------------------------------------------------------------------
    [_-]                     any character of: '_', '-'
----------------------------------------------------------------------
    \d{3,4}                  digits (0-9) (between 3 and 4 times
                             (matching the most amount possible))
----------------------------------------------------------------------
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
----------------------------------------------------------------------
      [_-]                     any character of: '_', '-'
----------------------------------------------------------------------
      \d{3,4}                  digits (0-9) (between 3 and 4 times
                               (matching the most amount possible))
----------------------------------------------------------------------
    )?                       end of grouping
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  (?:                      group, but do not capture (optional
                           (matching the most amount possible)):
----------------------------------------------------------------------
    [_ -]                    any character of: '_', ' ', '-'
----------------------------------------------------------------------
    \w                       word characters (a-z, A-Z, 0-9, _)
----------------------------------------------------------------------
  )?                       end of grouping
----------------------------------------------------------------------
  [_ ]{0,5}                any character of: '_', ' ' (between 0 and
                           5 times (matching the most amount
                           possible))
----------------------------------------------------------------------
  [rR]                     any character of: 'r', 'R'
----------------------------------------------------------------------
  (?:                      group, but do not capture (optional
                           (matching the most amount possible)):
----------------------------------------------------------------------
    [eE]                     any character of: 'e', 'E'
----------------------------------------------------------------------
    [vV]                     any character of: 'v', 'V'
----------------------------------------------------------------------
  )?                       end of grouping
----------------------------------------------------------------------
  (?:                      group, but do not capture (optional
                           (matching the most amount possible)):
----------------------------------------------------------------------
    \.                       '.'
----------------------------------------------------------------------
  )?                       end of grouping
----------------------------------------------------------------------
   ?                       ' ' (optional (matching the most amount
                           possible))
----------------------------------------------------------------------
  (                        group and capture to \2:
----------------------------------------------------------------------
    [\w\d-]?                 any character of: word characters (a-z,
                             A-Z, 0-9, _), digits (0-9), '-'
                             (optional (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \2
----------------------------------------------------------------------
   *                       ' ' (0 or more times (matching the most
                           amount possible))
----------------------------------------------------------------------
  (                        group and capture to \3:
----------------------------------------------------------------------
    .*                       any character except \n (0 or more times
                             (matching the most amount possible))
----------------------------------------------------------------------
  )                        end of \3
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------

通常使得在不使用外部工具的情况下更容易破译正则表达式的一件事是在正则表达式的末尾放置一个/ x修饰符(因此在正则表达式中允许大部分自由形式的空格)。 / x修饰符将允许您开始在正则表达式中插入空格,包括换行符和制表符,而不会更改表达式的函数。这有助于将正则表达式的各部分组合在一起。当然,如果RE在其中嵌入了重要的空白,那么这种情况就不会很好。在那种不寻常的情况下,你最终会改变表达式的含义。但对于任何正常的正则表达式,/ x修饰符是将其分解为意义簇的第一步。

例如,我可能会开始使用这样的正则表达式:

m/^
    (
        \d{3} [_-] [\w\d]{3} [_-] \d{3,4}
        (?:
            [_-] \d{3,4}
        )?
    )
    # ......and so on.
/x

对我来说,这样做有助于我更好地想象正在发生的事情。 您可以阅读以下POD中的正则表达式:perlrequick(快速入门指南),perlretut(更深入的教程),perlre(权威来源)和perlop。但杰弗里·弗里德(Jeffrey Friedl)的杰作“掌握正则表达式”(O'Reilly - Curently in the 3rd edition)中没有任何东西是如此有用。

注意:我注意到这个RE似乎在末尾附近有一个嵌入空间。它将更加明显地表示为\ x20,并且以这种方式更改它将使使用/ x修饰符安全。

答案 1 :(得分:11)

以下是解释:

^                   : begining of string
(                   : start group 1; it populates $dwg
    \d{3}           : 3 digit
    [_-]            : _ or - character
    [\w\d]{3}       : 3 alphanum, could be abreviated as \w{3}
    [_-]            : _ or - character
    \d{3,4}         : 3 or 4 digit
    (?:             : start NON capture group
        [_-]        : _ or - character
        \d{3,4}     : 3 or 4 digit
    )?              : end of non capture group optionnal
)                   : end of group 1
(?:                 : start NON capture group
    [_ -]           : _ or space or - character
    \w              : 1 alphanum
)?                  : end of non capture group optionnal
[_ ]{0,5}           : 0 to 5 _ or space char
[rR]                : r or R
(?:                 : start NON capture group
    [eE]            : e or E
    [vV]            : v or V
)?                  : end of non capture group optionnal
(?:\.)?             : a dot not captured optionnal
 ?                  : optionnal space
([\w\d-]?)          : group 2, 1 aphanum or - could be [\w-]; populates $rev
 *                  : 0 or more spaces
(.*)                : any number of any char but linefeed; populates $rest

答案 2 :(得分:8)

它似乎从文件名中提取日期$dwg,修订版$rev和后缀$rest。从广义上讲,日期最多可以有四个由下划线或连字符分隔的部分,修订版是一系列以rev为前缀的单词字符(大写或小写),后缀包含后面第一个空格后面的所有字符。修订。它相当混乱,看起来它试图同时解释许多微妙的不同情况。

^                  # After the start of the string,
(                  # $dwg gets
    \d{3}          # three digits,
    [_-]           # a separator,
    [\w\d]{3}      # three word characters,
    [_-]           # another separator,
    \d{3,4}        # three or four digits,
    (?:            # and
        [_-]       # a separator and
        \d{3,4}    # three or four more digits
    )?             # which are optional.
)
(?:                # Next,
    [_ -]          # another separator,
    \w             # followed by a word character,
)?                 # also optional;
[_ ]{0,5}          # a separator up to five characters long,
[rR]               # then "R" or "r",
(?:
    [eE]           # or "rev" in any mix of case,
    [vV]
)?                 # optionally;
(?:
    \.             # a dot,
)?                 # which too is optional;
 ?                 # and an optional space.
(                  # $rev gets
    [\w\d-]?       # an optional word character or dash.
)
 *                 # Any number of spaces later,
(.*)               # $rest gets the rest.

答案 3 :(得分:7)

这只是一个复杂的正则表达式,它将来自$file的三个捕获组放入$dwg$rev$rest

虽然正则表达式很复杂,但它并没有使用非常复杂的规则 - 可能除了(?:something),这是非捕获组。

例如,请参阅this作为perl正则表达式的介绍。