我要解析不符合评论中“no double hyphens”标准的XML文件,这会让MSXML抱怨。我正在寻找一种删除违规连字符的方法。
我正在使用StringRegExpReplace()
。我试图遵循正则表达式:
<!--(.*)--> : correctly gets comments
<!--(-*)--> : fails to be a correct regex (also tried escaping and using \x2D)
鉴于正确的模式,我会打电话:
StringRegExpReplace($xml_string,$correct_pattern,"") ;replace with nothing
如何在XML注释中匹配剩余的额外连字符,同时保留剩余的文本?
答案 0 :(得分:4)
您可以使用此模式:
(?|\G(?!\A)(?|-{2,}+([^->][^-]*)|(-[^-]+)|-+(?=-->)|-->[^<]*(*SKIP)(*FAIL))|[^<]*<+(?>[^<]+<+)*?(?:!--\K|[^<]*\z\K(*ACCEPT))(?|-*+([^->][^-]*)|-+(?=-->)|-?+([^-]+)|-->[^<]*(*SKIP)(*FAIL)()))
细节:
(?|
\G(?!\A) # contiguous to the precedent match (inside a comment)
(?|
-{2,}+([^->][^-]*) # duplicate hyphens, not part of the closing sequence
|
(-[^-]+) # preserve isolated hyphens
|
-+ (?=-->) # hyphens before closing sequence, break contiguity
|
-->[^<]* # closing sequence, go to next <
(*SKIP)(*FAIL) # break contiguity
)
|
[^<]*<+ # reach the next < (outside comment)
(?> [^<]+ <+ )*? # next < until !-- or the end of the string
(?: !-- \K | [^<]*\z\K (*ACCEPT) ) # new comment or end of the string
(?|
-*+ ([^->][^-]*) # possible hyphens not followed by >
|
-+ (?=-->) # hyphens before closing sequence, break contiguity
|
-?+ ([^-]+) # one hyphen followed by >
|
-->[^<]* # closing sequence, go to next <
(*SKIP)(*FAIL) () # break contiguity (note: "()" avoids a mysterious bug
) # in regex101, you can remove it)
)
使用此替换:\1
\G
功能可确保匹配是连续的。
有两种方法可以打破连续性:
(?=-->)
(*SKIP)(*FAIL)
,强制模式失败,之前匹配的所有字符都不会被重试。因此,当连续性被破坏或开始时,第一个主分支将失败(\G
锚的原因)并且将使用第二个分支。
\K
会从匹配结果中删除左侧的所有内容。
(*ACCEPT)
使模式成功无条件。
此模式大量使用分支重置功能(?|...(..)...|...(..)...|...)
,因此所有捕获组都具有相同的编号(换句话说,只有一个组,即组1。)
注意:即使这种模式很长,也只需要很少的步骤来获得匹配。非贪婪量词的影响尽可能地减少,并且每个备选方案都被分类并尽可能高效。其中一个目标是减少处理字符串所需的匹配总数。
答案 1 :(得分:3)
(?<!<!)--+(?!-?>)(?=(?:(?!-->).)*-->)
仅在--
和----
之间匹配<!--
(或-->
等)。您需要设置/s
参数以允许点匹配换行符。
<强>解释强>
(?<!<!) # Assert that we're not right at the start of a comment
--+ # Match two or more dashes --
(?= # only if the following can be matched further onwards:
(?!-?>) # First, make sure we're not at the end of the comment.
(?: # Then match the following group
(?!-->) # which must not contain -->
. # but may contain any character
)* # any number of times
--> # as long as --> follows.
) # End of lookahead assertion.
我认为正确的AutoIt语法是
StringRegExpReplace($xml_string, "(?s)(?<!<!)--+(?!-?>)(?=(?:(?!-->).)*-->)", "")