删除String中的HTML标记

时间:2011-02-02 18:42:34

标签: c# html

如何从以下字符串中删除HTML标记?

<P style="MARGIN: 0cm 0cm 10pt" class=MsoNormal><SPAN style="LINE-HEIGHT: 115%; 
FONT-FAMILY: 'Verdana','sans-serif'; COLOR: #333333; FONT-SIZE: 9pt">In an 
email sent just three days before the Deepwater Horizon exploded, the onshore 
<SPAN style="mso-bidi-font-weight: bold"><b>BP</b></SPAN> manager in charge of 
the drilling rig warned his supervisor that last-minute procedural changes were 
creating "chaos". April emails were given to government investigators by <SPAN 
style="mso-bidi-font-weight: bold"><b>BP</b></SPAN> and reviewed by The Wall 
Street Journal and are the most direct evidence yet that workers on the rig 
were unhappy with the numerous changes, and had voiced their concerns to <SPAN 
style="mso-bidi-font-weight: bold"><b>BP</b></SPAN>’s operations managers in 
Houston. This raises further questions about whether <SPAN 
style="mso-bidi-font-weight: bold"><b>BP</b></SPAN> managers properly 
considered the consequences of changes they ordered on the rig, an issue 
investigators say contributed to the disaster.</SPAN></p><br/>  

我正在将其写入Asponse.PDF,但HTML标记显示在PDF中。我该如何删除它们?

2 个答案:

答案 0 :(得分:93)

警告: This does not work for all cases and should not be used to process untrusted user input.

using System.Text.RegularExpressions;
...
const string HTML_TAG_PATTERN = "<.*?>";

static string StripHTML (string inputString)
{
   return Regex.Replace 
     (inputString, HTML_TAG_PATTERN, string.Empty);
}

答案 1 :(得分:10)

您应该使用HTML Agility Pack

HtmlDocument doc = ...
string text = doc.DocumentElement.InnerText;