所以我正在尝试将以下数据解析为CSV。从我的阅读来看,听起来最好的方法就是使用HAP,因为它有一个强大的解析器。
截至目前,WPF WebBrowser控件内容正在通过以下方式访问:
dynamic doc = this.wbControl.Document;
内容
<div class="content">
<fieldset>
<ul class="fieldsetr">
<li class="row medium">
<div class="field">
<div class="shell">
<em class="disable">Sender:</em>
</div>
</div>
<div>
<div class="clip">
<em>me@example.com</em>
</div>
</div>
</li>
<li class="row medium alt">
<div class="field">
<div class="shell">
<em class="disable">Recipient:</em>
</div>
</div>
<div>
<div class="clip">
<em>me2@example2.com</em>
</div>
</div>
</li>
<li class="row medium">
<div class="field">
<div class="shell">
<em class="disable">Message ID:</em>
</div>
</div>
<div>
<div class="clip">
<em>2342342345235</em>
</div>
</div>
</li>
<li class="row medium alt">
<div class="field">
<div class="shell">
<em class="disable">Message size:</em>
</div>
</div>
<div>
<div class="clip">
<em>18.74 KB
</em>
</div>
</div>
</li>
<li class="row medium">
<div class="field">
<div class="shell">
<em class="disable">Date and time received:</em>
</div>
</div>
<div>
<div class="clip">
<em>11/27/2012 6:17:22 AM</em>
</div>
</div>
</li>
<li class="row medium alt">
<div class="field">
<div class="shell">
<em class="disable">Date and time filtered:</em>
</div>
</div>
<div>
<div class="clip">
<em>11/27/2012 6:17:22 AM</em>
</div>
</div>
</li>
<li class="row medium">
<!-- Connector Details -->
</li>
<li class="row medium alt">
<div class="field">
<div class="shell">
<em class="disable">First delivery attempt:</em>
</div>
</div>
<div>
<div class="clip">
<em>11/27/2012 6:17:23 AM</em>
</div>
</div>
</li>
<li class="row medium">
<div class="field">
<div class="shell">
<em class="disable">Final delivery attempt:</em>
</div>
</div>
<div>
<div class="clip">
<em>11/27/2012 6:17:23 AM</em>
</div>
</div>
</li>
<li class="row medium alt">
<div class="field">
<div class="shell">
<em class="disable">From IP address:</em>
</div>
</div>
<div>
<div class="clip">
<em>1.2.3.4 <unknown></em>
</div>
</div>
</li>
<li class="row medium">
<div class="field">
<div class="shell">
<em class="disable">To IP address:</em>
</div>
</div>
<div>
<div class="clip">
<em>4.3.2.1 <mail.example2.com> </em>
</div>
</div>
</li>
<li class="row medium alt">
<div class="field">
<div class="shell">
<em class="disable">Filtering results:</em>
</div>
</div>
<div>
<div class="clip">
<em>Passed Filtering</em>
</div>
</div>
</li>
<li class="row medium">
<div class="field">
<div class="shell">
<em class="disable">Delivery result:</em>
</div>
</div>
<div>
<div class="clip">
<span><em>Delivered: 470 2.4.0 <2342342345235> [InternalId=2321233] Queued mail for delivery</em></span>
</div>
</div>
</li>
</ul>
</fieldset>
</div>
转换此数据的最佳方式是什么?这只是一条记录,但会添加更多记录。
修改
使用以下代码结束测试:
HtmlAgilityPack.HtmlDocument docHAP = new HtmlAgilityPack.HtmlDocument();
docHAP.LoadHtml(doc.Body.InnerHtml.ToString());
foreach(HtmlNode emNode in docHAP.DocumentNode.SelectNodes("//em"))
{
MessageBox.Show(emNode.InnerText.ToString());
}
如果有人提供更有效的解决方案,请随时告诉我。
答案 0 :(得分:1)
是的,使用HTML Agilty Pack - 它是.NET的开源HTML解析器。
什么是Html Agility Pack(HAP)?
这是一个敏捷的HTML解析器,它构建一个读/写DOM并支持普通的XPATH或XSLT(你实际上不需要理解XPATH或XSLT来使用它,不用担心......)。它是一个.NET代码库,允许您解析“out of the web”HTML文件。解析器非常容忍“真实世界”格式错误的HTML。对象模型与提出System.Xml非常相似,但对于HTML文档(或流)。
您可以使用它来查询HTML并提取您想要的任何数据。
只需使用XPath
,您就可以获得任何特定的element
/ attribute
/ text
数据。
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(this.wbControl.Document);
// get all the 'em' tags from HTML
foreach(HtmlNode emNode in doc.DocumentElement.SelectNodes("//em")
{
if (emNode.Attributes["class"] != null)
var value = emNode.Attributes["class"].Value;
}
// get all the `em` tags where 'class' attribute value is 'disable' from HTML
foreach(HtmlNode emNode in doc.DocumentElement
.SelectNodes("//em[@class='disabled']")
{
// ...
}