Question

我有一个像这样的HTML字符串：

<p>First Sentence is this.&#160;Second sentence is this.</p>

我可以使用<p>函数从上面的字符串中删除regex标记。

但是，如何从 中的上述字符串中删除winforms - 编码字符？

我不希望 出现在输出中。

Answer 1

您可以使用XElement.Parse获取节点值，如下所示：

 var htmlString = "<p>First Sentence is this.&#160;Second sentence is this.</p>";
 var result = System.Xml.Linq.XElement.Parse(htmlString).Value;

如果并非所有字符串都包含有效的XML结构，或者根本没有标记，则可以添加如下虚假标记：

 var htmlString = "<p>First Sentence is this.&#160;Second sentence is this.</p>";
 var result = System.Xml.Linq.XElement.Parse("<root>" + htmlString + "</root>").Value;

结果：

enter image description here

您可能希望为此添加错误处理，但这明显优于使用正则表达式。

修改

如果这仍然不起作用，并且您只想处理实体，则可以利用System.Web.HttpUtility.HtmlDecode方法将HTML实体替换为文字：

var final_result = System.Web.HttpUtility.HtmlDecode(result);

Answer 2

考虑输入是普通字符串的事实

string x = "<p>First Sentence is this.&#160;Second sentence is this.</p>";
x= x.Replace("&#160;"," ");

这太简单了，但会起作用。

完整的HTML Strip功能

2 个答案: