使用HtmlAgilityPack的不可知屏幕刮刀

时间:2015-03-11 17:29:11

标签: html-agility-pack

假设我想要一个屏幕抓取工具,如果您将HTML页面,转到XML文档的URL或转到文本文件的URL传递给它,则不需要关心。

的示例:

http://tonto.eia.doe.gov/oog/info/wohdp/dslpriwk.txt

http://google.com

如果页面是HTML或文本文件,这将有效:

public class ScreenScrapingService : IScreenScrapingService
{
    public XDocument Scrape(string url)
    {
        var scraper = new HtmlWeb();
        var stringWriter = new StringWriter();
        var xml = new XmlTextWriter(stringWriter);
        scraper.LoadHtmlAsXml(url, xml);
        var text = stringWriter.ToString();
        return XDocument.Parse(text);
    }
}

然而;如果它是XML文件,例如:

http://www.eia.gov/petroleum/gasdiesel/includes/gas_diesel_rss.xml

[Test]
public void Scrape_ShouldScrapeSomething()
{
    //arrange
    var sut = new ScreenScrapingService();

    //act
    var result = sut.Scrape("http://www.eia.gov/petroleum/gasdiesel/includes/gas_diesel_rss.xml");

    //assert

}

然后我收到错误:

 An exception of type 'System.Xml.XmlException' occurred in System.Xml.dll but was not handled in user code

是否可以写这个以便它不关心URL最终是什么?

1 个答案:

答案 0 :(得分:1)

要获得visual studio CTR+ALT+E上的确切异常并启用CommonLanguageRunTimeExceptions,看起来LoadHtmlAsXml需要html,所以最好的选择是使用WebClient.DownloadString(url)和{{1}将属性HtmlDocument设置为OptionOutputAsXml如下所示,当它失败时抓住它

true