Question

我要解析不符合评论中“no double hyphens”标准的XML文件，这会让MSXML抱怨。我正在寻找一种删除违规连字符的方法。

我正在使用StringRegExpReplace()。我试图遵循正则表达式：

<!--(.*)--> : correctly gets comments
<!--(-*)--> : fails to be a correct regex (also tried escaping and using \x2D)

鉴于正确的模式，我会打电话：

StringRegExpReplace($xml_string,$correct_pattern,"") ;replace with nothing

如何在XML注释中匹配剩余的额外连字符，同时保留剩余的文本？

Answer 1

您可以使用此模式：

(?|\G(?!\A)(?|-{2,}+([^->][^-]*)|(-[^-]+)|-+(?=-->)|-->[^<]*(*SKIP)(*FAIL))|[^<]*<+(?>[^<]+<+)*?(?:!--\K|[^<]*\z\K(*ACCEPT))(?|-*+([^->][^-]*)|-+(?=-->)|-?+([^-]+)|-->[^<]*(*SKIP)(*FAIL)()))

细节：

(?| 
    \G(?!\A) # contiguous to the precedent match (inside a comment)

    (?|
        -{2,}+([^->][^-]*) # duplicate hyphens, not part of the closing sequence
      |
         (-[^-]+)          # preserve isolated hyphens 
      |
         -+ (?=-->)        # hyphens before closing sequence, break contiguity
      |
         -->[^<]*          # closing sequence, go to next <
         (*SKIP)(*FAIL)    # break contiguity
    )
  |
    [^<]*<+ # reach the next < (outside comment)
    (?> [^<]+ <+ )*?       # next < until !-- or the end of the string 
    (?: !-- \K | [^<]*\z\K (*ACCEPT) ) # new comment or end of the string
    (?|
        -*+ ([^->][^-]*)   # possible hyphens not followed by >
      |
        -+ (?=-->)         # hyphens before closing sequence, break contiguity
      |
        -?+ ([^-]+)        # one hyphen followed by >
      |
        -->[^<]*           # closing sequence, go to next <
        (*SKIP)(*FAIL) ()  # break contiguity (note: "()" avoids a mysterious bug
    )                      # in regex101, you can remove it)
)

使用此替换：\1

online demo

\G功能可确保匹配是连续的。有两种方法可以打破连续性：

前瞻(?=-->)
回溯控制动词(*SKIP)(*FAIL)，强制模式失败，之前匹配的所有字符都不会被重试。

因此，当连续性被破坏或开始时，第一个主分支将失败（\G锚的原因）并且将使用第二个分支。

\K会从匹配结果中删除左侧的所有内容。

(*ACCEPT)使模式成功无条件。

此模式大量使用分支重置功能(?|...(..)...|...(..)...|...)，因此所有捕获组都具有相同的编号（换句话说，只有一个组，即组1。）

注意：即使这种模式很长，也只需要很少的步骤来获得匹配。非贪婪量词的影响尽可能地减少，并且每个备选方案都被分类并尽可能高效。其中一个目标是减少处理字符串所需的匹配总数。

Answer 2

(?<!<!)--+(?!-?>)(?=(?:(?!-->).)*-->)

仅在--和----之间匹配等）。您需要设置/s参数以允许点匹配换行符。

<强>解释

(?<!<!)   # Assert that we're not right at the start of a comment
--+       # Match two or more dashes --
(?=       # only if the following can be matched further onwards:
 (?!-?>)  # First, make sure we're not at the end of the comment.
 (?:      # Then match the following group
  (?!-->) # which must not contain -->
  .       # but may contain any character
 )*       # any number of times
 -->      # as long as --> follows.
)         # End of lookahead assertion.

测试live on regex101.com。

我认为正确的AutoIt语法是

StringRegExpReplace($xml_string, "(?s)(?<!<!)--+(?!-?>)(?=(?:(?!-->).)*-->)", "")

在格式错误的XML的注释中匹配双连字符

2 个答案: