Question

有没有简单的方法可以删除所有HTML标签或与字符串相关的任何HTML？

例如：

string title = "<b> Hulk Hogan's Celebrity Championship Wrestling &nbsp;&nbsp;&nbsp;<font color=\"#228b22\">[Proj # 206010]</font></b>&nbsp;&nbsp;&nbsp; (Reality Series, &nbsp;)"

以上应该是：

“Hulk Hogan的名人冠军摔跤[Proj＃206010]（现实系列赛）”

Answer 1

您可以使用这样的简单正则表达式：

public static string StripHTML(string input)
{
   return Regex.Replace(input, "<.*?>", String.Empty);
}

请注意，此解决方案有其自身的缺陷。有关详情，请参阅Remove HTML tags in String（尤其是@mehaase的评论）

另一种解决方案是使用HTML Agility Pack 您可以在此处找到使用该库的示例：HTML agility pack - removing unwanted tags without removing content?

Answer 2

您可以使用Html Agility pack解析字符串并获取InnerText。

    HtmlDocument htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(@"<b> Hulk Hogan's Celebrity Championship Wrestling &nbsp;&nbsp;&nbsp;<font color=\"#228b22\">[Proj # 206010]</font></b>&nbsp;&nbsp;&nbsp; (Reality Series, &nbsp;)");
    string result = htmlDoc.DocumentNode.InnerText;

Answer 3

您可以在字符串上使用以下代码，您将获得没有html部分的完整字符串。

string title = "<b> Hulk Hogan's Celebrity Championship Wrestling &nbsp;&nbsp;&nbsp;<font color=\"#228b22\">[Proj # 206010]</font></b>&nbsp;&nbsp;&nbsp; (Reality Series, &nbsp;)".Replace("&nbsp;",string.Empty);            
        string s = Regex.Replace(title, "<.*?>", String.Empty);

如何从字符串中删除所有HTML标记而不知道其中包含哪些标记？

3 个答案: