HTML Scrape Table to CSV(使用HAP?)

时间:2012-11-27 19:49:43

标签: c# html-agility-pack

所以我正在尝试将以下数据解析为CSV。从我的阅读来看,听起来最好的方法就是使用HAP,因为它有一个强大的解析器。

截至目前,WPF WebBrowser控件内容正在通过以下方式访问:

dynamic doc = this.wbControl.Document;

内容

        <div class="content">
                <fieldset>
                    <ul class="fieldsetr">
                        <li class="row medium">
                            <div class="field">
                                <div class="shell">
                                    <em class="disable">Sender:</em>
                                </div>
                            </div>
                            <div>
                                <div class="clip">
                                    <em>me@example.com</em>
                                </div>
                            </div>
                        </li>
                        <li class="row medium alt">
                            <div class="field">
                                <div class="shell">
                                    <em class="disable">Recipient:</em>
                                </div>
                            </div>
                            <div>
                                <div class="clip">
                                    <em>me2@example2.com</em>
                                </div>
                            </div>
                        </li>
                        <li class="row medium">
                            <div class="field">
                                <div class="shell">
                                    <em class="disable">Message ID:</em>
                                </div>
                            </div>
                            <div>
                                <div class="clip">
                                    <em>2342342345235</em>
                                </div>
                            </div>
                        </li>
                        <li class="row medium alt">
                            <div class="field">
                                <div class="shell">
                                    <em class="disable">Message size:</em>
                                </div>
                            </div>
                            <div>
                                <div class="clip">
                                    <em>18.74 KB
                                    </em>
                                </div>
                            </div>
                        </li>
                        <li class="row medium">
                            <div class="field">
                                <div class="shell">
                                    <em class="disable">Date and time received:</em>
                                </div>
                            </div>
                            <div>
                                <div class="clip">
                                    <em>11/27/2012 6:17:22 AM</em>
                                </div>
                            </div>
                        </li>
                        <li class="row medium alt">
                            <div class="field">
                                <div class="shell">
                                    <em class="disable">Date and time filtered:</em>
                                </div>
                            </div>
                            <div>
                                <div class="clip">
                                    <em>11/27/2012 6:17:22 AM</em>
                                </div>
                            </div>
                        </li>
                        <li class="row medium">
                            <!-- Connector Details -->

                        </li>                            
                        <li class="row medium alt">
                            <div class="field">
                                <div class="shell">
                                    <em class="disable">First delivery attempt:</em>
                                </div>
                            </div>
                            <div>
                                <div class="clip">
                                    <em>11/27/2012 6:17:23 AM</em>
                                </div>
                            </div>
                        </li>
                        <li class="row medium">
                            <div class="field">
                                <div class="shell">
                                    <em class="disable">Final delivery attempt:</em>
                                </div>
                            </div>
                            <div>
                                <div class="clip">
                                    <em>11/27/2012 6:17:23 AM</em>
                                </div>
                            </div>
                        </li>
                        <li class="row medium alt">
                            <div class="field">
                                <div class="shell">
                                    <em class="disable">From IP address:</em>
                                </div>
                            </div>
                            <div>
                                <div class="clip">
                                    <em>1.2.3.4 &lt;unknown&gt;</em>
                                </div>
                            </div>
                        </li>
                        <li class="row medium">
                            <div class="field">
                                <div class="shell">
                                    <em class="disable">To IP address:</em>
                                </div>
                            </div>
                            <div>
                                <div class="clip">
                                    <em>4.3.2.1 &lt;mail.example2.com&gt; </em>
                                </div>
                            </div>
                        </li>
                        <li class="row medium alt">
                            <div class="field">
                                <div class="shell">
                                    <em class="disable">Filtering results:</em>
                                </div>
                            </div>
                            <div>
                                <div class="clip">
                                    <em>Passed Filtering</em>
                                </div>
                            </div>
                        </li>
                        <li class="row medium">
                            <div class="field">
                                <div class="shell">
                                    <em class="disable">Delivery result:</em>
                                </div>
                            </div>
                            <div>
                                <div class="clip">
                                    <span><em>Delivered: 470 2.4.0 &lt;2342342345235&gt; [InternalId=2321233] Queued mail for delivery</em></span>
                                </div>
                            </div>
                        </li>
                    </ul>
                </fieldset>
        </div>

转换此数据的最佳方式是什么?这只是一条记录,但会添加更多记录。

修改

使用以下代码结束测试:

            HtmlAgilityPack.HtmlDocument docHAP = new HtmlAgilityPack.HtmlDocument();
            docHAP.LoadHtml(doc.Body.InnerHtml.ToString());

            foreach(HtmlNode emNode in docHAP.DocumentNode.SelectNodes("//em"))
            {
                MessageBox.Show(emNode.InnerText.ToString());
            }

如果有人提供更有效的解决方案,请随时告诉我。

1 个答案:

答案 0 :(得分:1)

是的,使用HTML Agilty Pack - 它是.NET的开源HTML解析器。

  

什么是Html Agility Pack(HAP)?

     

这是一个敏捷的HTML解析器,它构建一个读/写DOM并支持普通的XPATH或XSLT(你实际上不需要理解XPATH或XSLT来使用它,不用担心......)。它是一个.NET代码库,允许您解析“out of the web”HTML文件。解析器非常容忍“真实世界”格式错误的HTML。对象模型与提出System.Xml非常相似,但对于HTML文档(或流)。

您可以使用它来查询HTML并提取您想要的任何数据。

只需使用XPath,您就可以获得任何特定的element / attribute / text数据。

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(this.wbControl.Document);

// get all the 'em' tags from HTML
foreach(HtmlNode emNode in doc.DocumentElement.SelectNodes("//em")
{    
    if (emNode.Attributes["class"] != null)
       var value = emNode.Attributes["class"].Value;
}

// get all the `em` tags where 'class' attribute value is 'disable' from HTML
foreach(HtmlNode emNode in doc.DocumentElement
                              .SelectNodes("//em[@class='disabled']")
{    
    // ...
}