Question

我正在尝试使用C＃中的Regex从字符串中的h2标记之间删除内容：

<h2>content needs removing</h2> other content...

我有以下正则表达式，根据我用来测试它的Regex好友软件，应该可以工作，但它没有：

myString = Regex.Replace(myString, @"<h[0-9]>.*</h[0-9]>", String.Empty);

我有另一个在此之后运行的正则表达式删除所有其他HTML标记，它以相同的方式调用并且工作正常。任何人都可以帮我解决为什么这不起作用？

Answer 1

请勿使用正则表达式。

HTML不是常规语言，因此无法使用正则表达式正确解析。

例如，您的正则表达式将匹配：

<h2>sample</h1>

无效。处理嵌套结构时，这会导致意外结果（.*贪婪并匹配输入HTML字符串中最后一个结束h[0-9]标记之前的所有内容）

您可以使用XMLDocument（HTML不是XML，但这足以满足您的目的）或者您可以使用Html Agility Pack。

Answer 2

试试这段代码：

String sourcestring = "<h2>content needs removing</h2> other content...";
String matchpattern = @"\s?<h[0-9]>[^<]+</h[0-9]>\s?";
String replacementpattern = @"";
MessageBox.Show(Regex.Replace(sourcestring,matchpattern,replacementpattern));

[^<]+比.+更安全，因为它会停止收集<的位置。

Answer 3

这对我来说很好用：

string myString = "<h2>content needs removing</h2> other content...";
Console.WriteLine(myString);
myString = Regex.Replace(myString, "<h[0-9]>.*</h[0-9]>", string.Empty);
Console.WriteLine(myString);

显示器：

<h2>content needs removing</h2> other content...
other content...

正如所料。

如果你的问题是你的真实案例有几个不同的标题标签，那么你就会遇到贪婪*量词的问题。它将创造最长的匹配。例如，如果您有：

<h2>content needs removing</h2> other content...<h3>some more headings</h3> and some other stuff

您将匹配从<h2>到</h3>的所有内容并替换它。要解决此问题，您需要使用延迟量词：

myString = Regex.Replace(myString, "<h[0-9]>.*?</h[0-9]>", string.Empty);

请留下：

other content... and some other stuff

但请注意，这不会修复嵌套的<h>标记。正如@fardjad所说，使用Regex for HTML通常不是一个好主意。

去掉h2标签之间的内容

3 个答案: