使用大括号将XML标记内的每个匹配单词包围

时间:2017-08-28 15:52:12

标签: regex

我有如下的html字符串:

<whatevertag do-not-change-this="word" or-this-word="">
  these words should be replaced with a word inside braces,
  and also the same word thing for
  <whatevertag>
      the nested tags that has the word
  </whatevertag>
</whatevertag>

我试图像这样输出:

<whatevertag do-not-change-this="word" or-this-word="">
  these {word}s should be replaced with a {word} inside braces,
  and also the same {word} thing for
  <whatevertag>
      the nested tags that has the {word}
  </whatevertag>
</whatevertag>

我已经尝试过这个表达式(>[^>]*?)(word)([^<]*?<),并且为了替换,我使用了$1{$2}$3 ..令人惊讶的是(至少对我来说)它只适用于第一场比赛,输出是:

<whatevertag do-not-change-this="word" or-this-word="">
    these {word}s should be replaced with a word inside braces,
    and also the same word thing for
    <whatevertag>
        the nested tags that has the {word}
    </whatevertag>
</whatevertag>

为什么会这样。以及如何解决它?

1 个答案:

答案 0 :(得分:2)

你的正则表达式失败的原因是:

(>[^>]*?)                  # read '>', then lazily any character except '>'
(word)                     # until you encounter 'word'
([^<]*?<)                  # then lazily read any character except '<' until you find a '<'

所以,只要你已经捕获了“#”字样。你的正则表达式会一直读到第一个&#39;&lt;&#39;这就是为什么第二个单词&#39;未被捕获。

你可以使用的是:

(?:(?!word).)+(word)

说明:

(?:                         # Do not capture
(?!word).)+                 # Negative lookahead for word. Read 1 char
(word)                      # until you find 'word'

查看example

编辑:重读你的问题,你明确表示你想捕捉&#34;之外的所有内容。标签。看一眼: example 2

正则表达式是:

((?!word)[^>])+(word)([^<]+) # read all characters, except 
                             # '>' until you encounter 'word'
                             # read 'word'
                             # capture all following characters, except '<'