为什么我的正则表达式如此懒惰?

时间:2012-03-31 16:48:26

标签: regex

为什么这个正则表达式如此懒惰?它应该返回引用高度/宽度属性,介于两者之间(可选),然后是另一个高度/宽度属性(可选)。它只获得第一个属性,然后即使它可以匹配更多也退出。

((?:height|width)=["']\d*["'])([\s\w:;'"=])*?((?:height|width)=["']\d*["'])?

sample code on regexpal

1 个答案:

答案 0 :(得分:6)

查看正在发生的事情的最简单方法是将其分解为扩展格式。在扩展格式中,你的正则表达式......

((?:height|width)=["']\d*["'])([\s\w:;'"=])*?((?:height|width)=["']\d*["'])?

然后变成(带有评论,扩展格式合法):

(                     # a group that captures...
    (?:height|width)  # Height or width
    =                 # The Equals sign
    ["']              # a double quote or quote
    \d*               # zero or more digits 0-9
    ["']              # a double quote or quote
)                     # requried
(                     # zero or more groups that capture...space chars, 
    [\s\w:;'"=]       # letters, numbers, colon, quote, dobule quote, and equals 
)*?                   # zero or more times, lazily (giving up as much as it can)
(                     # a group that...
    (?:height|width)  # height or width
    =                 # colon
    ["']              # double quote or quote
    \d*               # zero or more numbers
    ["']              # double quote or quote
)?                    # optionally

因此,您的正则表达式可能会生成1个组,最多可生成N个组,具体取决于您正在使用的正则表达式引擎。你的最后一组将是你想要的小组,如果有的话。删除第二组(?)的延迟修饰符,并使第二组不捕获,如下所示:

(                     # a group that captures...
    (?:height|width)  # Height or width (non capturing)
    =                 # The Equals sign
    ["']              # a double quote or quote
    \d*               # zero or more digits 0-9
    ["']              # a double quote or quote
)                     # requried
(?:                   # zero or more groups of space chars, letters, 
    [\s\w:;'"=]       # numbers, colon, quote, dobule quote, and equals 
)*                    # zero or more times as much as it can UNTIL...
(                     # a group that captures...
    (?:height|width)  # height or width (non-capturing)
    =                 # colon
    ["']              # double quote or quote
    \d*               # zero or more numbers
    ["']              # double quote or quote
)?                    # optional

现在第一个和最后一个标签分别在第1组和第2组中,忽略了中间的内容。如果有最后一个,它将被捕获。

注意:它可能没有捕获最后一部分,因为没有指定需要在中间组中捕获的字符。如果有逗号,#或任何其他类型的标记字符,则不会由该中间组的字符类指定。你可以考虑用以下代码替换中间的那个:

    ["']              # a double quote or quote
)                     # requried
.*                    # Anything, zero or more times, UNTIL...
(                     # a group that...
    (?:height|width)  # height or width (non-capturing)

并查看该DOES是否匹配。如果是,您可能需要进一步增强中间组的角色。

如果您不关心中间组中发生了多少匹配,只需捕获它,使用非捕获组捕获每个子集,然后使用一组来捕获整个中间组集合: / p>

    ["']              # a double quote or quote
)                     # requried
(                     # a group that captures...
    (?:               # zero or more groups of space chars, letters, 
        [\s\w:;'"=]   # numbers, colon, quote, dobule quote, and equals 
    )*                # zero or more times as much as it can
)                     # UNTIL...
(                     # a group that captures...
    (?:height|width)  # height or width (non-capturing)

现在你将获得固定数量的捕获,第一部分总是在第1组中,中间部分总是在第2组中,最后一部分(如果它在那里)在第3组中。