如何在ABOT C#Web Crawler中获取html输出页面?

时间:2013-09-12 15:08:04

标签: c# web-crawler

我正在尝试使用ABOT在c#中创建网络爬虫.i搜索了许多示例并添加了ABOT网络爬虫。从那我只能得到日志输出而不是Html页面输出。我想只获得html页面输出。因为HTML输出是HTML敏捷工具的输入。 帮助我从C#中的ABOT网络爬虫获取HTML输出。 感谢。

3 个答案:

答案 0 :(得分:8)

解释here on the quickstart page

//Create an instance of the crawler and subscribe to the PageCrawlCompleted event
PoliteWebCrawler crawler = new PoliteWebCrawler();
crawler.PageCrawlCompletedAsync += crawler_ProcessPageCrawlCompleted;

//The event handler method
void crawler_ProcessPageCrawlCompleted(object sender, PageCrawlCompletedArgs e)
{
    CrawledPage crawledPage = e.CrawledPage;

    if (crawledPage.WebException != null || crawledPage.HttpWebResponse.StatusCode != HttpStatusCode.OK)
        Console.WriteLine("Crawl of page failed {0}", crawledPage.Uri.AbsoluteUri);
    else
        Console.WriteLine("Crawl of page succeeded {0}", crawledPage.Uri.AbsoluteUri);


    //crawledPage.Content.Text //raw html
    //crawledPage.HtmlDocument //lazy loaded html agility pack object (HtmlAgilityPack.HtmlDocument)
    //crawledPage.CSDocument   //lazy loaded cs query object (CsQuery.Cq)
}

答案 1 :(得分:1)

void crawler_ProcessPageCrawlCompleted(object sender, PageCrawlCompletedArgs e)
{
    CrawledPage crawledPage = e.CrawledPage;
    crawledPage.Content.Text // HTML

}

答案 2 :(得分:0)

获取htmlpage仅使用:

crawledPage.Content

函数内部

`static void crawler_ProcessPageCrawlCompleted(object sender, PageCrawlCompletedArgs e)`

例如:

static void crawler_ProcessPageCrawlCompleted(object sender, PageCrawlCompletedArgs e)
    {
        CrawledPage crawledPage = e.CrawledPage;

        if (crawledPage.WebException != null || crawledPage.HttpWebResponse.StatusCode != HttpStatusCode.OK)
            Console.WriteLine("Crawl of page failed {0}", crawledPage.Uri.AbsoluteUri);
        else
            Console.WriteLine("Crawl of page succeeded {0}", crawledPage.Uri.AbsoluteUri);

        if (string.IsNullOrEmpty(crawledPage.Content.Text))
            Console.WriteLine("Page had no content {0}", crawledPage.Uri.AbsoluteUri);

        var htmlAgilityPackDocument = crawledPage.HtmlDocument; //Html Agility Pack parser
        var angleSharpHtmlDocument = crawledPage.AngleSharpHtmlDocument; 

        //get content

        Console.WriteLine(crawledPage.Content);


    }