有人可以帮我理解这个用于匹配HTML中src
标记的img
属性的正则表达式吗?
src=(?:(['""])(?<src>(?:(?!\1).)*)\1|(?<src>[^\s>]+))
src= this is easy
(?:(['""])(?<src>(?:(?!\1).)*) ?: is unknown (['""]) matches either single or double quotes, followed by a named group "src" that matches unknown strings
\1 unknown
| "or"
(?<src>[^\s>]+)) named group "src" matches one or more of line start or whitespace
简要说明?:
是什么意思?
所以(?:...)
是常规括号的非捕获版本。匹配括号内的正则表达式,但在执行匹配后或在模式中稍后引用时,无法检索组匹配的子字符串。
谢谢@mbratch
\ 1是什么意思?
最后,感叹号在这里有什么特别的意义吗? (否定?)
答案 0 :(得分:3)
答案 1 :(得分:2)
举个例子,考虑src="img.jpg"
作为我们正在解析的文本
在正则表达式中,\1
指的是第一个捕获组。在这种特殊情况下,第一个捕获组是(['""])
。部分(?:(['""])(?<src>(?:(?!\1).)*)
是一个非捕获组,在我们的示例中与"img.jpg
匹配。特别是,(['""])
匹配任何引号字符。然后(?!\1)
是第一组中匹配的引号字符的负前瞻,因此(?:(?!\1).)
匹配任何不是第一组匹配的引号字符且(?<src>(?:(?!\1).)*)
匹配的字符,命名捕获组,在结束引号字符之前的一系列字符。然后,以下\1
匹配结束引号字符。
答案 2 :(得分:2)
src= # matches literal "src="
(?: # the ?: suppresses capturing. generally a good practice if capturing
# is not explicitly necessary
(['"]) # matches either ' or ", and captures what was matched in group 1
# (because this is the first set of parentheses where capturing is not
# suppressed)
(?<src> # start another (named) capturing group with the name "src"
(?: # start non-capturing group
(?!\1)
# a negative lookahead, if its contents match, the lookahead causes the
# pattern to fail
# the \1 is a backreference and matches what was matched in capturing
# group no. 1
.)* # match any character, end of non-capturing group, repeat
# summary of this non-capturing group: for each character, check that
# it is not the kind of quote we matched at the start. if it's not,
# then consume it. repeat as long as possible.
) # end of capturing group "src"
\1 # again a backreference to what was matched inside capturing group 1
# i.e. match the same kind of quote that started the attribute value
| # or
(?<src> # again a capturing group with the name "src"
[^\s>]+
# match as many non-space, non-> character as possible (at least one)
) # end of capturing group. this case treats unquoted attribute values.
) # end of non-capturing group (which was used to group the alternation)
进一步阅读:
如果您想稍微刷新一下您的正则表达式知识,我建议您阅读整个教程。这绝对物有所值。
需要更多资源来帮助理解复杂的表达式:
答案 3 :(得分:1)
1&gt;它首先捕获组1中的['""]
中的任何1个,即(['""])
2&gt;然后它匹配0到多个字符,这不是第1组中捕获的字符,即(?:(?!\1).)*
3&gt;它执行第2步,直到它与第1组中捕获的匹配,即\1
以上3个步骤与(['""])[^\1]*\1
或强>
1&gt;它匹配所有非空格,&gt; src=
之后的字符,即[^\s>]+
注意强>
我会使用src=(['""]).*?\1
.*
贪婪,尽可能匹配..
.*?
是懒惰的,它尽可能地匹配..
例如,请考虑此字符串hello hi world
表示正则表达式^h.*l
输出为hello hi worl
表示正则表达式^h.*?l
输出为hel
答案 4 :(得分:1)
我使用RegexBuddy获取此输出:
Match the characters “src=” literally «src=»
Match the regular expression below «(?:(['""])(?<src>(?:(?!\1).)*)\1|(?<src>[^\s>]+))»
Match either the regular expression below (attempting the next alternative only if this one fails) «(['""])(?<src>(?:(?!\1).)*)\1»
Match the regular expression below and capture its match into backreference number 1 «(['""])»
Match a single character present in the list “'"” «['""]»
Match the regular expression below and capture its match into backreference with name “src” «(?<src>(?:(?!\1).)*)»
Match the regular expression below «(?:(?!\1).)*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!\1)»
Match the same text as most recently matched by capturing group number 1 «\1»
Match any single character that is not a line break character «.»
Match the same text as most recently matched by capturing group number 1 «\1»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «(?<src>[^\s>]+)»
Match the regular expression below and capture its match into backreference with name “src” «(?<src>[^\s>]+)»
Match a single character NOT present in the list below «[^\s>]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
A whitespace character (spaces, tabs, line breaks, etc.) «\s»
The character “>” «>»
这个正则表达式对你所描述的非常糟糕。 src="
是有效的输入。