我正在编写一个实用程序,按计划将evernote笔记导出到Outlook中。 Outlook API需要纯文本,而Evernote会输出纯文本注释的XHTML doc版本。我需要的是去除所有标签和unescape嵌入在Evernote导出文件中的源XHTML文档。
基本上我需要转身;
<note>
<title>Test Sync Note 1</title>
<content>
<![CDATA[ <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml.dtd">
<en-note bgcolor="#FFFFFF">
<div>Test Sync Note 1</div>
<div>This i has some text in it</div>
<div> </div>
<div> </div>
<div>and a second line</div>
</en-note>
]]>
</content>
<created>20081028T045727Z</created>
<updated>20081028T051346Z</updated>
<tag>Test</tag>
</note>
进入
Test Sync Note 1 This i has some text in it and a second line
我可以很容易地解析CDATA部分并获得4行文本,但是我需要一种可靠的方法去除div,unescape并处理可能已经偷偷摸摸的任何额外的HTML。
我假设有一些MS API组合可以完成这项工作,但我不知道。
答案 0 :(得分:1)
我会使用正则表达式删除所有HTML标记,这个非常基本,我相信如果它不能正常工作,你可以调整它。
Regex.Replace("<div>your html in here</div>",@"<(.|\n)*?>",string.Empty)
;
答案 1 :(得分:1)
您还可以使用xslt转换将xml转换为文本文档。
答案 2 :(得分:1)
您可以使用HTML Agility Pack。
答案 3 :(得分:0)
据我所知,没有任何事情可以执行该特定工作,但您可能希望查看使用XSLT或浏览IXPathNavigable。
答案 4 :(得分:0)
string xml = @"<note>
<title>Test Sync Note 1</title>
<content>
<![CDATA[ <?xml version=""1.0"" encoding=""UTF-8""?>
<!DOCTYPE en-note SYSTEM ""http://xml.evernote.com/pub/enml.dtd"">
<en-note bgcolor=""#FFFFFF"">
<div>Test Sync Note 1</div>
<div>This i has some text in it</div>
<div> </div>
<div> </div>
<div>and a second line</div>
</en-note>
]]>
</content>
<created>20081028T045727Z</created>
<updated>20081028T051346Z</updated>
<tag>Test</tag>
</note>
";
XPathDocument doc = new XPathDocument(new StringReader(xml));
XPathNavigator nav = doc.CreateNavigator();
// Compile a standard XPath expression
XPathExpression expr;
expr = nav.Compile("/note/content");
XPathNodeIterator iterator = nav.Select(expr);
// Iterate on the node set
try
{
while (iterator.MoveNext())
{
//Get the XML in the CDATA
XPathNavigator nav2 = iterator.Current.Clone();
XPathDocument doc2 = new XPathDocument(new StringReader(nav2.Value.Trim()));
//Parse the XML in the CDATA
XPathNavigator nav3 = doc2.CreateNavigator();
expr = nav3.Compile("/en-note");
XPathNodeIterator iterator2 = nav3.Select(expr);
iterator2.MoveNext();
XPathNavigator nav4 = iterator2.Current.Clone();
//Output the value directly, does not preserve the formatting
Console.WriteLine("Direct Try:");
Console.WriteLine(nav4.Value);
//This works, but is ugly
Console.WriteLine("Ugly Try:");
Console.WriteLine(nav4.InnerXml.Replace("<div>","").Replace("</div>",Environment.NewLine));
}
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}