我有一个存储在表格中的Html片段。 不是整页,没有标签等,只是基本的格式化。
我希望能够在给定页面上显示Html仅为文本,无格式(实际上只是前30到50个字符,但这很容易)。
如何将该Html中的“text”作为直文放入字符串?
所以这段代码。
<b>Hello World.</b><br/><p><i>Is there anyone out there?</i><p>
变为:
Hello World。那里有人吗?
答案 0 :(得分:81)
免费开源HtmlAgilityPack有一个in one of its samples方法,可以将HTML转换为纯文本。
var plainText = HtmlUtilities.ConvertToPlainText(string html);
输入一个HTML字符串,如
<b>hello, <i>world!</i></b>
你会得到一个纯文本结果,如:
hello world!
答案 1 :(得分:37)
我无法使用HtmlAgilityPack,所以我为自己写了第二个最佳解决方案
private static string HtmlToPlainText(string html)
{
const string tagWhiteSpace = @"(>|$)(\W|\n|\r)+<";//matches one or more (white space or line breaks) between '>' and '<'
const string stripFormatting = @"<[^>]*(>|$)";//match any character between '<' and '>', even when end tag is missing
const string lineBreak = @"<(br|BR)\s{0,1}\/{0,1}>";//matches: <br>,<br/>,<br />,<BR>,<BR/>,<BR />
var lineBreakRegex = new Regex(lineBreak, RegexOptions.Multiline);
var stripFormattingRegex = new Regex(stripFormatting, RegexOptions.Multiline);
var tagWhiteSpaceRegex = new Regex(tagWhiteSpace, RegexOptions.Multiline);
var text = html;
//Decode html specific characters
text = System.Net.WebUtility.HtmlDecode(text);
//Remove tag whitespace/line breaks
text = tagWhiteSpaceRegex.Replace(text, "><");
//Replace <br /> with line breaks
text = lineBreakRegex.Replace(text, Environment.NewLine);
//Strip formatting
text = stripFormattingRegex.Replace(text, string.Empty);
return text;
}
答案 2 :(得分:23)
如果您正在谈论标签剥离,如果您不必担心<script>
标签之类的问题,则相对简单。如果你需要做的就是显示没有标签的文本,你可以使用正则表达式来完成:
<[^>]*>
如果你不必担心<script>
标签之类的话,那么你需要比正则表达式更强大的功能,因为你需要跟踪状态,更像是一个Context Free Grammar(CFG)。虽然你可以用“从左到右”或非贪婪的匹配来完成它。
如果您可以使用正则表达式,那么有许多网页都有很好的信息:
如果您需要更复杂的CFG行为,我建议您使用第三方工具,遗憾的是我不知道推荐使用它的好方法。
答案 3 :(得分:20)
HTTPUtility.HTMLEncode()
用于将HTML标记编码为字符串。它会为您解决所有繁重的工作。来自MSDN Documentation:
如果在HTTP流中传递空格和标点符号等字符,则可能会在接收端错误解释它们。 HTML编码将HTML中不允许的字符转换为字符实体等价物; HTML解码反转了编码。例如,当嵌入文本块时,字符
<
和>
被编码为<
和>
用于HTTP传输。
HTTPUtility.HTMLEncode()
方法,详细here:
public static void HtmlEncode(
string s,
TextWriter output
)
用法:
String TestString = "This is a <Test String>.";
StringWriter writer = new StringWriter();
Server.HtmlEncode(TestString, writer);
String EncodedString = writer.ToString();
答案 4 :(得分:10)
要添加到vfilby的答案,您只需在代码中执行RegEx替换;不需要新的课程。如果像我这样的其他新手在这个问题上遇到了麻烦。
using System.Text.RegularExpressions;
则...
private string StripHtml(string source)
{
string output;
//get rid of HTML tags
output = Regex.Replace(source, "<[^>]*>", string.Empty);
//get rid of multiple blank lines
output = Regex.Replace(output, @"^\s*$\n", string.Empty, RegexOptions.Multiline);
return output;
}
答案 5 :(得分:6)
将HTML转换为纯文本的三步流程
首先你需要为HtmlAgilityPack安装Nuget包 第二次创建此类
public class HtmlToText
{
public HtmlToText()
{
}
public string Convert(string path)
{
HtmlDocument doc = new HtmlDocument();
doc.Load(path);
StringWriter sw = new StringWriter();
ConvertTo(doc.DocumentNode, sw);
sw.Flush();
return sw.ToString();
}
public string ConvertHtml(string html)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
StringWriter sw = new StringWriter();
ConvertTo(doc.DocumentNode, sw);
sw.Flush();
return sw.ToString();
}
private void ConvertContentTo(HtmlNode node, TextWriter outText)
{
foreach(HtmlNode subnode in node.ChildNodes)
{
ConvertTo(subnode, outText);
}
}
public void ConvertTo(HtmlNode node, TextWriter outText)
{
string html;
switch(node.NodeType)
{
case HtmlNodeType.Comment:
// don't output comments
break;
case HtmlNodeType.Document:
ConvertContentTo(node, outText);
break;
case HtmlNodeType.Text:
// script and style must not be output
string parentName = node.ParentNode.Name;
if ((parentName == "script") || (parentName == "style"))
break;
// get text
html = ((HtmlTextNode)node).Text;
// is it in fact a special closing node output as text?
if (HtmlNode.IsOverlappedClosingElement(html))
break;
// check the text is meaningful and not a bunch of whitespaces
if (html.Trim().Length > 0)
{
outText.Write(HtmlEntity.DeEntitize(html));
}
break;
case HtmlNodeType.Element:
switch(node.Name)
{
case "p":
// treat paragraphs as crlf
outText.Write("\r\n");
break;
}
if (node.HasChildNodes)
{
ConvertContentTo(node, outText);
}
break;
}
}
}
参考Judah Himango的答案使用上述课程
第三,您需要创建上述类的Object并使用ConvertHtml(HTMLContent)
方法将HTML转换为纯文本而不是ConvertToPlainText(string html);
HtmlToText htt=new HtmlToText();
var plainText = htt.ConvertHtml(HTMLContent);
答案 6 :(得分:4)
它具有不折叠长内联空白的限制,但它绝对是可移植的并且遵循webbrowser等布局。
static string HtmlToPlainText(string html) {
string buf;
string block = "address|article|aside|blockquote|canvas|dd|div|dl|dt|" +
"fieldset|figcaption|figure|footer|form|h\\d|header|hr|li|main|nav|" +
"noscript|ol|output|p|pre|section|table|tfoot|ul|video";
string patNestedBlock = $"(\\s*?</?({block})[^>]*?>)+\\s*";
buf = Regex.Replace(html, patNestedBlock, "\n", RegexOptions.IgnoreCase);
// Replace br tag to newline.
buf = Regex.Replace(buf, @"<(br)[^>]*>", "\n", RegexOptions.IgnoreCase);
// (Optional) remove styles and scripts.
buf = Regex.Replace(buf, @"<(script|style)[^>]*?>.*?</\1>", "", RegexOptions.Singleline);
// Remove all tags.
buf = Regex.Replace(buf, @"<[^>]*(>|$)", "", RegexOptions.Multiline);
// Replace HTML entities.
buf = WebUtility.HtmlDecode(buf);
return buf;
}
答案 7 :(得分:4)
在HtmlAgilityPack中没有名称为“ConvertToPlainText”的方法,但您可以将html字符串转换为CLEAR字符串:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlString);
var textString = doc.DocumentNode.InnerText;
Regex.Replace(textString , @"<(.|n)*?>", string.Empty).Replace(" ", "");
这对我有用。但是我没有在'HtmlAgilityPack'中找到名称'ConvertToPlainText'的方法。
答案 8 :(得分:4)
我找到的最简单的方法:
public static void main(String[] args)
{
Scanner ip = new Scanner(System.in);
int a;
System.out.println("Enter Some Input");
try{
a = ip.nextInt();
}
catch(InputMismatchException msg){
System.out.println("Input Mismatch Exception has occured " + msg.getMessage());
}
}
HtmlFilter类位于Microsoft.TeamFoundation.WorkItemTracking.Controls.dll
dll可以在这样的文件夹中找到: %ProgramFiles%\ Common Files \ microsoft shared \ Team Foundation Server \ 14.0 \
在VS 2015中,dll还需要引用位于同一文件夹中的Microsoft.TeamFoundation.WorkItemTracking.Common.dll。
答案 9 :(得分:3)
我认为最简单的方法是制作一个'字符串'扩展方法(根据理查德建议的用户):
using System;
using System.Text.RegularExpressions;
public static class StringHelpers
{
public static string StripHTML(this string HTMLText)
{
var reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
return reg.Replace(HTMLText, "");
}
}
然后在程序中的任何'string'变量上使用此扩展方法:
var yourHtmlString = "<div class=\"someclass\"><h2>yourHtmlText</h2></span>";
var yourTextString = yourHtmlString.StripHTML();
我使用这种扩展方法将html格式化的注释转换为纯文本,这样它就能在水晶报表上正确显示,而且效果很好!
答案 10 :(得分:2)
如果您的数据包含HTML标记,并且您想要显示它以便某人可以查看标记,请使用HttpServerUtility :: HtmlEncode。
如果您的数据中包含HTML标记,并且您希望用户看到呈现的标记,则按原样显示文本。如果文本代表整个网页,请使用IFRAME。
如果您的数据包含HTML标记,并且您想要删除标记并只显示未格式化的文本,请使用正则表达式。
答案 11 :(得分:2)
我遇到了类似的问题,并找到了最佳解决方案。下面的代码对我来说很完美。
private string ConvertHtml_Totext(string source)
{
try
{
string result;
// Remove HTML Development formatting
// Replace line breaks with space
// because browsers inserts space
result = source.Replace("\r", " ");
// Replace line breaks with space
// because browsers inserts space
result = result.Replace("\n", " ");
// Remove step-formatting
result = result.Replace("\t", string.Empty);
// Remove repeating spaces because browsers ignore them
result = System.Text.RegularExpressions.Regex.Replace(result,
@"( )+", " ");
// Remove the header (prepare first by clearing attributes)
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*head([^>])*>","<head>",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"(<( )*(/)( )*head( )*>)","</head>",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
"(<head>).*(</head>)",string.Empty,
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
// remove all scripts (prepare first by clearing attributes)
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*script([^>])*>","<script>",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"(<( )*(/)( )*script( )*>)","</script>",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
//result = System.Text.RegularExpressions.Regex.Replace(result,
// @"(<script>)([^(<script>\.</script>)])*(</script>)",
// string.Empty,
// System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"(<script>).*(</script>)",string.Empty,
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
// remove all styles (prepare first by clearing attributes)
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*style([^>])*>","<style>",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"(<( )*(/)( )*style( )*>)","</style>",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
"(<style>).*(</style>)",string.Empty,
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
// insert tabs in spaces of <td> tags
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*td([^>])*>","\t",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
// insert line breaks in places of <BR> and <LI> tags
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*br( )*>","\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*li( )*>","\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
// insert line paragraphs (double line breaks) in place
// if <P>, <DIV> and <TR> tags
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*div([^>])*>","\r\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*tr([^>])*>","\r\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*p([^>])*>","\r\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
// Remove remaining tags like <a>, links, images,
// comments etc - anything that's enclosed inside < >
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<[^>]*>",string.Empty,
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
// replace special characters:
result = System.Text.RegularExpressions.Regex.Replace(result,
@" "," ",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"•"," * ",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"‹","<",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"›",">",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"™","(tm)",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"⁄","/",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<","<",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@">",">",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"©","(c)",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"®","(r)",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
// Remove all others. More can be added, see
// http://hotwired.lycos.com/webmonkey/reference/special_characters/
result = System.Text.RegularExpressions.Regex.Replace(result,
@"&(.{2,6});", string.Empty,
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
// for testing
//System.Text.RegularExpressions.Regex.Replace(result,
// this.txtRegex.Text,string.Empty,
// System.Text.RegularExpressions.RegexOptions.IgnoreCase);
// make line breaking consistent
result = result.Replace("\n", "\r");
// Remove extra line breaks and tabs:
// replace over 2 breaks with 2 and over 4 tabs with 4.
// Prepare first to remove any whitespaces in between
// the escaped characters and remove redundant tabs in between line breaks
result = System.Text.RegularExpressions.Regex.Replace(result,
"(\r)( )+(\r)","\r\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
"(\t)( )+(\t)","\t\t",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
"(\t)( )+(\r)","\t\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
"(\r)( )+(\t)","\r\t",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
// Remove redundant tabs
result = System.Text.RegularExpressions.Regex.Replace(result,
"(\r)(\t)+(\r)","\r\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
// Remove multiple tabs following a line break with just one tab
result = System.Text.RegularExpressions.Regex.Replace(result,
"(\r)(\t)+","\r\t",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
// Initial replacement target string for line breaks
string breaks = "\r\r\r";
// Initial replacement target string for tabs
string tabs = "\t\t\t\t\t";
for (int index=0; index<result.Length; index++)
{
result = result.Replace(breaks, "\r\r");
result = result.Replace(tabs, "\t\t\t\t");
breaks = breaks + "\r";
tabs = tabs + "\t";
}
// That's it.
return result;
}
catch
{
MessageBox.Show("Error");
return source;
}
}
\ n和\ r之类的转义字符必须首先删除,因为它们会导致正则表达式按预期停止工作。
此外,要使结果字符串正确显示在文本框中,可能需要将其拆分并设置文本框的Lines属性,而不是分配给Text属性。
this.txtResult.Lines = StripHTML(this.txtSource.Text).Split(“ \ r” .ToCharArray());
来源:https://www.codeproject.com/Articles/11902/Convert-HTML-to-Plain-Text-2
答案 12 :(得分:0)
这是我的解决方案:
public static void main(String[] args) throws IOException {
System.out.println(method1("numbs.txt")); // if file is not in same folder
System.out.println(method2("numbs.txt")); //use absolute path: C:\users\...\numbs.txt
}
示例:
public string StripHTML(string html)
{
var regex = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
return System.Web.HttpUtility.HtmlDecode((regex.Replace(html, "")));
}
答案 13 :(得分:0)
我有同样的问题,只是我的html有一个简单的预先知道的布局,如:
<DIV><P>abc</P><P>def</P></DIV>
所以我最终使用了这么简单的代码:
string.Join (Environment.NewLine, XDocument.Parse (html).Root.Elements ().Select (el => el.Value))
哪个输出:
abc
def
答案 14 :(得分:0)
没有写但是使用:
using HtmlAgilityPack;
using System;
using System.IO;
using System.Text.RegularExpressions;
namespace foo {
//small but important modification to class https://github.com/zzzprojects/html-agility-pack/blob/master/src/Samples/Html2Txt/HtmlConvert.cs
public static class HtmlToText {
public static string Convert(string path) {
HtmlDocument doc = new HtmlDocument();
doc.Load(path);
return ConvertDoc(doc);
}
public static string ConvertHtml(string html) {
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
return ConvertDoc(doc);
}
public static string ConvertDoc(HtmlDocument doc) {
using (StringWriter sw = new StringWriter()) {
ConvertTo(doc.DocumentNode, sw);
sw.Flush();
return sw.ToString();
}
}
internal static void ConvertContentTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo) {
foreach (HtmlNode subnode in node.ChildNodes) {
ConvertTo(subnode, outText, textInfo);
}
}
public static void ConvertTo(HtmlNode node, TextWriter outText) {
ConvertTo(node, outText, new PreceedingDomTextInfo(false));
}
internal static void ConvertTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo) {
string html;
switch (node.NodeType) {
case HtmlNodeType.Comment:
// don't output comments
break;
case HtmlNodeType.Document:
ConvertContentTo(node, outText, textInfo);
break;
case HtmlNodeType.Text:
// script and style must not be output
string parentName = node.ParentNode.Name;
if ((parentName == "script") || (parentName == "style")) {
break;
}
// get text
html = ((HtmlTextNode)node).Text;
// is it in fact a special closing node output as text?
if (HtmlNode.IsOverlappedClosingElement(html)) {
break;
}
// check the text is meaningful and not a bunch of whitespaces
if (html.Length == 0) {
break;
}
if (!textInfo.WritePrecedingWhiteSpace || textInfo.LastCharWasSpace) {
html = html.TrimStart();
if (html.Length == 0) { break; }
textInfo.IsFirstTextOfDocWritten.Value = textInfo.WritePrecedingWhiteSpace = true;
}
outText.Write(HtmlEntity.DeEntitize(Regex.Replace(html.TrimEnd(), @"\s{2,}", " ")));
if (textInfo.LastCharWasSpace = char.IsWhiteSpace(html[html.Length - 1])) {
outText.Write(' ');
}
break;
case HtmlNodeType.Element:
string endElementString = null;
bool isInline;
bool skip = false;
int listIndex = 0;
switch (node.Name) {
case "nav":
skip = true;
isInline = false;
break;
case "body":
case "section":
case "article":
case "aside":
case "h1":
case "h2":
case "header":
case "footer":
case "address":
case "main":
case "div":
case "p": // stylistic - adjust as you tend to use
if (textInfo.IsFirstTextOfDocWritten) {
outText.Write("\r\n");
}
endElementString = "\r\n";
isInline = false;
break;
case "br":
outText.Write("\r\n");
skip = true;
textInfo.WritePrecedingWhiteSpace = false;
isInline = true;
break;
case "a":
if (node.Attributes.Contains("href")) {
string href = node.Attributes["href"].Value.Trim();
if (node.InnerText.IndexOf(href, StringComparison.InvariantCultureIgnoreCase) == -1) {
endElementString = "<" + href + ">";
}
}
isInline = true;
break;
case "li":
if (textInfo.ListIndex > 0) {
outText.Write("\r\n{0}.\t", textInfo.ListIndex++);
} else {
outText.Write("\r\n*\t"); //using '*' as bullet char, with tab after, but whatever you want eg "\t->", if utf-8 0x2022
}
isInline = false;
break;
case "ol":
listIndex = 1;
goto case "ul";
case "ul": //not handling nested lists any differently at this stage - that is getting close to rendering problems
endElementString = "\r\n";
isInline = false;
break;
case "img": //inline-block in reality
if (node.Attributes.Contains("alt")) {
outText.Write('[' + node.Attributes["alt"].Value);
endElementString = "]";
}
if (node.Attributes.Contains("src")) {
outText.Write('<' + node.Attributes["src"].Value + '>');
}
isInline = true;
break;
default:
isInline = true;
break;
}
if (!skip && node.HasChildNodes) {
ConvertContentTo(node, outText, isInline ? textInfo : new PreceedingDomTextInfo(textInfo.IsFirstTextOfDocWritten) { ListIndex = listIndex });
}
if (endElementString != null) {
outText.Write(endElementString);
}
break;
}
}
}
internal class PreceedingDomTextInfo {
public PreceedingDomTextInfo(BoolWrapper isFirstTextOfDocWritten) {
IsFirstTextOfDocWritten = isFirstTextOfDocWritten;
}
public bool WritePrecedingWhiteSpace { get; set; }
public bool LastCharWasSpace { get; set; }
public readonly BoolWrapper IsFirstTextOfDocWritten;
public int ListIndex { get; set; }
}
internal class BoolWrapper {
public BoolWrapper() { }
public bool Value { get; set; }
public static implicit operator bool(BoolWrapper boolWrapper) {
return boolWrapper.Value;
}
public static implicit operator BoolWrapper(bool boolWrapper) {
return new BoolWrapper { Value = boolWrapper };
}
}
}
答案 15 :(得分:0)
取决于“html”的含义。最复杂的情况是完整的网页。这也是最容易处理的,因为您可以使用文本模式的Web浏览器。请参阅Wikipedia article列出的网络浏览器,包括文字模式浏览器。 Lynx可能是最知名的,但其他一个可能更适合您的需求。
答案 16 :(得分:0)
我认为它有一个简单的答案:
public string RemoveHTMLTags(string HTMLCode)
{
string str=System.Text.RegularExpressions.Regex.Replace(HTMLCode, "<[^>]*>", "");
return str;
}
答案 17 :(得分:0)
对于希望为给定的html文档的文本缩写查找OP问题的确切解决方案的人,请使用以下解决方案。
与每个提议的解决方案一样,以下代码也有一些假设:
he<span>ll</span>o
应该输出hello
。内联列表
标签:https://www.w3schools.com/htmL/html_blocks.asp 考虑到上述情况,以下带有已编译正则表达式的字符串扩展名将输出有关html转义字符的预期纯文本,而在null输入上将输出null。
public static class StringExtensions
{
public static string ConvertToPlain(this string html)
{
if (html == null)
{
return html;
}
html = scriptRegex.Replace(html, string.Empty);
html = inlineTagRegex.Replace(html, string.Empty);
html = tagRegex.Replace(html, " ");
html = HttpUtility.HtmlDecode(html);
html = multiWhitespaceRegex.Replace(html, " ");
return html.Trim();
}
private static readonly Regex inlineTagRegex = new Regex("<\\/?(a|span|sub|sup|b|i|strong|small|big|em|label|q)[^>]*>", RegexOptions.Compiled | RegexOptions.Singleline);
private static readonly Regex scriptRegex = new Regex("<(script|style)[^>]*?>.*?</\\1>", RegexOptions.Compiled | RegexOptions.Singleline);
private static readonly Regex tagRegex = new Regex("<[^>]+>", RegexOptions.Compiled | RegexOptions.Singleline);
private static readonly Regex multiWhitespaceRegex = new Regex("\\s+", RegexOptions.Compiled | RegexOptions.Singleline);
}
答案 18 :(得分:-4)
public static string StripTags2(string html) { return html.Replace(“&lt;”,“&lt;”)。Replace(“&gt;”,“&gt;”); }
通过这个你逃脱所有“&lt;”和“&gt;”在一个字符串中。这是你想要的吗?