Question

我知道使用正则表达式来解析html通常是一个非首发但我不想要任何聪明的东西......

以此为例

<div><!--<b>Test</b>-->Test</div>
<div><!--<b>Test2</b>-->Test2</div>

我想删除不在之间的任何内容：

<b>Test</b><b>Test2</b>

保证标签正确匹配（没有未关闭/嵌套的注释）。

我需要使用什么正则表达式？

Answer 1

替换模式：

(?s)((?!-->).)*<!--|-->((?!<!--).)*

空字符串。

一个简短的解释：

(?s)              # enable DOT-ALL
((?!-->).)*<!--   # match anything except '-->' ending with '<!--'
|                 # OR
-->((?!<!--).)*   # match '-->' followed by anything except '<!--'

使用正则表达式处理（X）HTML时要小心。每当注释的一部分出现在tag-attributes或CDATA块中时，就会出错。

修改

看到你最活跃的标签是JavaScript，这是一个JS演示：

print(
  "<div><!--<b>Test</b>-->Test</div>\n<div><!--<b>Test2</b>-->Test2</div>"
  .replace(
    /((?!-->)[\s\S])*<!--|-->((?!<!--)[\s\S])*/g,
    ""
  )
);

打印：

<b>Test</b><b>Test2</b>

请注意，由于JS不支持(?s)标志，因此我使用了与任何字符匹配的等效[\s\S]（包括换行符）。

在Ideone上进行测试：http://ideone.com/6yQaK

编辑II

PHP演示看起来像：

<?php
$s = "<div><!--<b>Test</b>-->Test</div>\n<div><!--<b>Test2</b>-->Test2</div>";
echo preg_replace('/(?s)((?!-->).)*<!--|-->((?!<!--).)*/', '', $s);
?>

还打印：

<b>Test</b><b>Test2</b>

可以在Ideone上看到：http://ideone.com/Bm2uJ

Answer 2

s/-->.*?<--//g strips off anything between "-->" and the next "<--"

s/^.*?<--// strips off from the beginning to the first occurence of "<--"

s/-->.*?$// strips off from the last occurence of "-->" to the end

.*匹配任意数量的字符，.*?匹配尽可能少的字符数，以便孔模式匹配

^代表字符串的开头，$代表结尾

Answer 3

另一种可能性是

.*?<!--(.*?)-->.*?(?=<!--|$)

并替换为

$1

见here on Regexr

如果你逐行读取你的字符串，这将匹配任何东西，直到第一个评论，将第一个内容的内容放入组1，然后匹配任何东西，直到行的结尾或下一个评论。

正则表达式删除任何不是HTML注释的东西

3 个答案:

修改

编辑II