取消Abot上的网站抓取

时间:2019-12-06 14:59:59

标签: c# web-crawler abot

我有一个使用Abot进行爬网的域的列表,目的是当它在一个站点上找到亚马逊链接时,退出,然后移至下一个站点。但是我似乎看不到我可以退出页面爬网。例如

https://github.com/sjdirect/abot

static Main(string[] args)
{
    var domains= new List<string> { "http://domain1", "http://domain2" };

    foreach (string domain in domains)
    {
        var config = new CrawlConfiguration
        {
            MaxPagesToCrawl = 100,
            MinCrawlDelayPerDomainMilliSeconds = 3000
        };

        var crawler = new PoliteWebCrawler(config);

        crawler.PageCrawlCompleted += PageCrawlCompleted;
        var uri = new Uri(domain);
        var crawlResult = crawler.Crawl(uri);
    }
}

private static void PageCrawlCompleted(object sender, PageCrawlCompletedArgs e)
{
    var crawledPage = e.CrawledPage;
    var crawlContext = e.CrawlContext;

    var document = crawledPage.AngleSharpHtmlDocument;
    var anchors = document.QuerySelectorAll("a").OfType<IHtmlAnchorElement>();
    var hrefs = anchors.Select(x => x.Href).ToList();

    var regEx= new Regex(@"https?:\/\/(www|smile)\.amazon(\.co\.uk|\.com).*");
    var resultList = hrefs.Where(f => regEx.IsMatch(f)).ToList();

    if (resultList.Any())
    {
        //NEED TO EXIT THE SITE CRAWL HERE
    }

}

2 个答案:

答案 0 :(得分:2)

我建议以下...

var myCancellationToken = new CancellationTokenSource();
crawler.CrawlAsync(someUri, myCancellationToken);

private static void PageCrawlCompleted(object sender, PageCrawlCompletedArgs e)
{
    //More performant (since the parsing has already been done by Abot)
    var hasAmazonLinks = e.CrawledPage.ParsedLinks
      .Any(hl => hl.HrefValue.AbsoluteUri
         .ToLower()
         .Contains("amazon.com"));

    if (hasAmazonLinks)
    {
        //LOG SOMETHING BEFORE YOU STOP THE CRAWL!!!!!

        //Option A: Preferred method, Will clear all scheduled pages and cancel any threads that are currently crawling
        myCancellationToken.Cancel();

        //Option B: Same result as option A but no need to do anything with tokens. Not the preferred method. 
        e.CrawlContext.IsCrawlHardStopRequested = true;

        //Option C: Will clear all scheduled pages but will allow any threads that are currently crawling to complete. No cancellation tokens needed. Consider it a soft stop to the crawl.
        e.CrawlContext.IsCrawlStopRequested = true;
    }
}

答案 1 :(得分:0)

PoliteWebCrawler旨在开始抓取并更深入地浏览网站URL。如果您只想获取URL的内容(例如网站的首页),则可以使用专为此类工作而设计的PageRequester

var pageRequester = new PageRequester(new CrawlConfiguration(), new WebContentExtractor());

var crawledPage = await pageRequester.MakeRequestAsync(new Uri("http://google.com"));
Log.Logger.Information("{result}", new
{
    url = crawledPage.Uri,
    status = Convert.ToInt32(crawledPage.HttpResponseMessage.StatusCode)
});

顺便说一句,如果要在此过程中停止搜寻器,可以使用以下两种方法之一:

//1. hard crawl stop
crawlContext.CancellationTokenSource.Cancel();
//2. soft stop
crawlContext.IsCrawlStopRequested = true;