我正在寻找C#代码将HTML文档转换为纯文本。
我不是在寻找简单的标签剥离,而是使用合理保存原始布局来输出纯文本。
输出应如下所示:
我看过HTML Agility Pack,但我认为这不是我需要的。有没有人有任何其他建议?
编辑:我只是从CodePlex下载HTML Agility Pack,并运行了Html2Txt项目。多么令人失望(至少是html到文本转换的模块)!所有这一切都是剥离标签,弄平表等。输出看起来不像Html2Txt @ W3C产生的。太糟糕了,这个来源似乎不可用。 我想看看是否有更多的“罐装”解决方案。
编辑2:感谢大家的建议。 FlySwat 向我倾斜了我想去的方向。我可以使用System.Diagnostics.Process
类使用“-dump”开关运行lynx.exe,将文本发送到标准输出,并使用ProcessStartInfo.UseShellExecute = false
和ProcessStartInfo.RedirectStandardOutput = true
捕获标准输出。我将把所有这些包装在一个C#类中。这个代码只会偶尔被调用,所以我不太关心产生一个新进程而不是代码执行它。另外,Lynx很快!!
答案 0 :(得分:38)
关于后人的HtmlAgilityPack的说明。该项目包含一个example of parsing text to html,正如OP所指出的那样,它根本不像处理HTML所设想的任何人那样处理空格。有一些全文渲染解决方案,其他人注意到这个问题,这不是(它甚至不能处理当前形式的表),但它是轻量级和快速的,这是我想创建一个简单的文本HTML电子邮件的版本。
using System.IO;
using System.Text.RegularExpressions;
using HtmlAgilityPack;
//small but important modification to class https://github.com/zzzprojects/html-agility-pack/blob/master/src/Samples/Html2Txt/HtmlConvert.cs
public static class HtmlToText
{
public static string Convert(string path)
{
HtmlDocument doc = new HtmlDocument();
doc.Load(path);
return ConvertDoc(doc);
}
public static string ConvertHtml(string html)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
return ConvertDoc(doc);
}
public static string ConvertDoc (HtmlDocument doc)
{
using (StringWriter sw = new StringWriter())
{
ConvertTo(doc.DocumentNode, sw);
sw.Flush();
return sw.ToString();
}
}
internal static void ConvertContentTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo)
{
foreach (HtmlNode subnode in node.ChildNodes)
{
ConvertTo(subnode, outText, textInfo);
}
}
public static void ConvertTo(HtmlNode node, TextWriter outText)
{
ConvertTo(node, outText, new PreceedingDomTextInfo(false));
}
internal static void ConvertTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo)
{
string html;
switch (node.NodeType)
{
case HtmlNodeType.Comment:
// don't output comments
break;
case HtmlNodeType.Document:
ConvertContentTo(node, outText, textInfo);
break;
case HtmlNodeType.Text:
// script and style must not be output
string parentName = node.ParentNode.Name;
if ((parentName == "script") || (parentName == "style"))
{
break;
}
// get text
html = ((HtmlTextNode)node).Text;
// is it in fact a special closing node output as text?
if (HtmlNode.IsOverlappedClosingElement(html))
{
break;
}
// check the text is meaningful and not a bunch of whitespaces
if (html.Length == 0)
{
break;
}
if (!textInfo.WritePrecedingWhiteSpace || textInfo.LastCharWasSpace)
{
html= html.TrimStart();
if (html.Length == 0) { break; }
textInfo.IsFirstTextOfDocWritten.Value = textInfo.WritePrecedingWhiteSpace = true;
}
outText.Write(HtmlEntity.DeEntitize(Regex.Replace(html.TrimEnd(), @"\s{2,}", " ")));
if (textInfo.LastCharWasSpace = char.IsWhiteSpace(html[html.Length - 1]))
{
outText.Write(' ');
}
break;
case HtmlNodeType.Element:
string endElementString = null;
bool isInline;
bool skip = false;
int listIndex = 0;
switch (node.Name)
{
case "nav":
skip = true;
isInline = false;
break;
case "body":
case "section":
case "article":
case "aside":
case "h1":
case "h2":
case "header":
case "footer":
case "address":
case "main":
case "div":
case "p": // stylistic - adjust as you tend to use
if (textInfo.IsFirstTextOfDocWritten)
{
outText.Write("\r\n");
}
endElementString = "\r\n";
isInline = false;
break;
case "br":
outText.Write("\r\n");
skip = true;
textInfo.WritePrecedingWhiteSpace = false;
isInline = true;
break;
case "a":
if (node.Attributes.Contains("href"))
{
string href = node.Attributes["href"].Value.Trim();
if (node.InnerText.IndexOf(href, StringComparison.InvariantCultureIgnoreCase)==-1)
{
endElementString = "<" + href + ">";
}
}
isInline = true;
break;
case "li":
if(textInfo.ListIndex>0)
{
outText.Write("\r\n{0}.\t", textInfo.ListIndex++);
}
else
{
outText.Write("\r\n*\t"); //using '*' as bullet char, with tab after, but whatever you want eg "\t->", if utf-8 0x2022
}
isInline = false;
break;
case "ol":
listIndex = 1;
goto case "ul";
case "ul": //not handling nested lists any differently at this stage - that is getting close to rendering problems
endElementString = "\r\n";
isInline = false;
break;
case "img": //inline-block in reality
if (node.Attributes.Contains("alt"))
{
outText.Write('[' + node.Attributes["alt"].Value);
endElementString = "]";
}
if (node.Attributes.Contains("src"))
{
outText.Write('<' + node.Attributes["src"].Value + '>');
}
isInline = true;
break;
default:
isInline = true;
break;
}
if (!skip && node.HasChildNodes)
{
ConvertContentTo(node, outText, isInline ? textInfo : new PreceedingDomTextInfo(textInfo.IsFirstTextOfDocWritten){ ListIndex = listIndex });
}
if (endElementString != null)
{
outText.Write(endElementString);
}
break;
}
}
}
internal class PreceedingDomTextInfo
{
public PreceedingDomTextInfo(BoolWrapper isFirstTextOfDocWritten)
{
IsFirstTextOfDocWritten = isFirstTextOfDocWritten;
}
public bool WritePrecedingWhiteSpace {get;set;}
public bool LastCharWasSpace { get; set; }
public readonly BoolWrapper IsFirstTextOfDocWritten;
public int ListIndex { get; set; }
}
internal class BoolWrapper
{
public BoolWrapper() { }
public bool Value { get; set; }
public static implicit operator bool(BoolWrapper boolWrapper)
{
return boolWrapper.Value;
}
public static implicit operator BoolWrapper(bool boolWrapper)
{
return new BoolWrapper{ Value = boolWrapper };
}
}
例如,以下HTML代码......
<!DOCTYPE HTML>
<html>
<head>
</head>
<body>
<header>
Whatever Inc.
</header>
<main>
<p>
Thanks for your enquiry. As this is the 1<sup>st</sup> time you have contacted us, we would like to clarify a few things:
</p>
<ol>
<li>
Please confirm this is your email by replying.
</li>
<li>
Then perform this step.
</li>
</ol>
<p>
Please solve this <img alt="complex equation" src="http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png"/>. Then, in any order, could you please:
</p>
<ul>
<li>
a point.
</li>
<li>
another point, with a <a href="http://en.wikipedia.org/wiki/Hyperlink">hyperlink</a>.
</li>
</ul>
<p>
Sincerely,
</p>
<p>
The whatever.com team
</p>
</main>
<footer>
Ph: 000 000 000<br/>
mail: whatever st
</footer>
</body>
</html>
...将转变为:
Whatever Inc.
Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things:
1. Please confirm this is your email by replying.
2. Then perform this step.
Please solve this [complex equation<http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png>]. Then, in any order, could you please:
* a point.
* another point, with a hyperlink<http://en.wikipedia.org/wiki/Hyperlink>.
Sincerely,
The whatever.com team
Ph: 000 000 000
mail: whatever st
......而不是:
Whatever Inc.
Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things:
Please confirm this is your email by replying.
Then perform this step.
Please solve this . Then, in any order, could you please:
a point.
another point, with a hyperlink.
Sincerely,
The whatever.com team
Ph: 000 000 000
mail: whatever st
答案 1 :(得分:30)
你可以用这个:
public static string StripHTML(string HTMLText, bool decode = true)
{
Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
var stripped = reg.Replace(HTMLText, "");
return decode ? HttpUtility.HtmlDecode(stripped) : stripped;
}
<强>更新强>
感谢我更新的评论以改进此功能
答案 2 :(得分:17)
答案 3 :(得分:10)
您正在寻找的是一个文本模式DOM渲染器,它可以输出文本,就像Lynx或其他文本浏览器一样......这比你想象的要难得多。
答案 4 :(得分:4)
因为我想用LF和项目符号转换为纯文本,我在codeproject上找到了这个非常好的解决方案,它涵盖了许多转换用例:
是的,看起来很大,但工作正常。
答案 5 :(得分:3)
您是否尝试过http://www.aaronsw.com/2002/html2text/它是Python,但是开源。
答案 6 :(得分:3)
假设你有很好的html,你也可以尝试一下XSL转换。
以下是一个例子:
using System;
using System.IO;
using System.Xml.Linq;
using System.Xml.XPath;
using System.Xml.Xsl;
class Html2TextExample
{
public static string Html2Text(XDocument source)
{
var writer = new StringWriter();
Html2Text(source, writer);
return writer.ToString();
}
public static void Html2Text(XDocument source, TextWriter output)
{
Transformer.Transform(source.CreateReader(), null, output);
}
public static XslCompiledTransform _transformer;
public static XslCompiledTransform Transformer
{
get
{
if (_transformer == null)
{
_transformer = new XslCompiledTransform();
var xsl = XDocument.Parse(@"<?xml version='1.0'?><xsl:stylesheet version=""1.0"" xmlns:xsl=""http://www.w3.org/1999/XSL/Transform"" exclude-result-prefixes=""xsl""><xsl:output method=""html"" indent=""yes"" version=""4.0"" omit-xml-declaration=""yes"" encoding=""UTF-8"" /><xsl:template match=""/""><xsl:value-of select=""."" /></xsl:template></xsl:stylesheet>");
_transformer.Load(xsl.CreateNavigator());
}
return _transformer;
}
}
static void Main(string[] args)
{
var html = XDocument.Parse("<html><body><div>Hello world!</div></body></html>");
var text = Html2Text(html);
Console.WriteLine(text);
}
}
答案 7 :(得分:2)
最简单的可能是标签剥离结合使用文本布局元素替换某些标签,例如列表元素(li)的破折号和br和p的换行符。 将它扩展到表格应该不会太难。
答案 8 :(得分:2)
我对HtmlAgility有一些解码问题,我不想花时间调查它。
相反,我使用了Microsoft Team Foundation API中的that utility:
var text = HtmlFilter.ConvertToPlainText(htmlContent);
答案 9 :(得分:0)
这是使用HtmlAgilityPack的简短简短回答。您可以在LinqPad中运行它。
var html = "<div>..whatever html</div>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var plainText = doc.DocumentNode.InnerText;
我只是在需要HTML解析的任何.NET项目中使用HtmlAgilityPack。简单,可靠,快速。
答案 10 :(得分:0)
此功能将“您在浏览器中看到的内容”转换为带有换行符的纯文本。 (如果要在浏览器中查看结果,只需使用注释的返回值即可)
public string HtmlFileToText(string filePath)
{
using (var browser = new WebBrowser())
{
string text = File.ReadAllText(filePath);
browser.ScriptErrorsSuppressed = true;
browser.Navigate("about:blank");
browser?.Document?.OpenNew(false);
browser?.Document?.Write(text);
return browser.Document?.Body?.InnerText;
//return browser.Document?.Body?.InnerText.Replace(Environment.NewLine, "<br />");
}
}
答案 11 :(得分:0)
我过去曾使用Detagger。它可以很好地将HTML格式化为文本,而不仅仅是标记移除器。
答案 12 :(得分:0)
Another post建议HTML agility pack:
这是一个敏捷的HTML解析器 构建一个读/写DOM并支持 普通的XPATH或XSLT(实际上你 不必了解XPATH也不了解 使用XSLT,不用担心......)。它是 允许您使用的.NET代码库 解析“out of the web”HTML文件。该 解析器非常宽容“真实 世界“格式错误的HTML。对象 模型与提出的非常相似 System.Xml,但用于HTML文档(或 流)。
答案 13 :(得分:-1)
我有recently blogged on a solution通过使用Markdown XSLT文件转换HTML源来为我工作。 HTML源代码当然需要首先是有效的XML
答案 14 :(得分:-1)
尝试简单易用的方式:只需致电 public string StripHTML(WebBrowser webp)
{
try
{
doc.execCommand("SelectAll", true, null);
IHTMLSelectionObject currentSelection = doc.selection;
if (currentSelection != null)
{
IHTMLTxtRange range = currentSelection.createRange() as IHTMLTxtRange;
if (range != null)
{
currentSelection.empty();
return range.text;
}
}
}
catch (Exception ep)
{
//MessageBox.Show(ep.Message);
}
return "";
}
{{1}}
答案 15 :(得分:-1)
我不知道C#,但有一个相当小的&amp;这里容易阅读python html2txt脚本:http://www.aaronsw.com/2002/html2text/
答案 16 :(得分:-2)
在Genexus中你可以用Regex制作
&amp; pattern ='&lt; [^&gt;] +&gt;'
&安培; TSTRPNOT =安培; TSTRPNOT.ReplaceRegEx(安培;图案, “”)
在Genexus possiamo gestirlo con Regex,
答案 17 :(得分:-3)
您可以使用 WebBrowser 控件在内存中呈现您的html内容。 LoadCompleted 事件触发后......
IHTMLDocument2 htmlDoc = (IHTMLDocument2)webBrowser.Document;
string innerHTML = htmlDoc.body.innerHTML;
string innerText = htmlDoc.body.innerText;
答案 18 :(得分:-3)
如果您使用的是.NET Framework 4.5,则可以使用System.Net.WebUtility.HtmlDecode(),它接受HTML编码的字符串并返回已解码的字符串。
在MSDN上记录:http://msdn.microsoft.com/en-us/library/system.net.webutility.htmldecode(v=vs.110).aspx
您也可以在Windows应用商店应用中使用它。
答案 19 :(得分:-4)
这是在C#中将HTML转换为Text或RTF的另一种解决方案:
SautinSoft.HtmlToRtf h = new SautinSoft.HtmlToRtf();
h.OutputFormat = HtmlToRtf.eOutputFormat.TextUnicode;
string text = h.ConvertString(htmlString);
这个图书馆不是免费的,这是商业产品,它是我自己的产品。