我有一个使用Abot进行爬网的域的列表,目的是当它在一个站点上找到亚马逊链接时,退出,然后移至下一个站点。但是我似乎看不到我可以退出页面爬网。例如
https://github.com/sjdirect/abot
static Main(string[] args)
{
var domains= new List<string> { "http://domain1", "http://domain2" };
foreach (string domain in domains)
{
var config = new CrawlConfiguration
{
MaxPagesToCrawl = 100,
MinCrawlDelayPerDomainMilliSeconds = 3000
};
var crawler = new PoliteWebCrawler(config);
crawler.PageCrawlCompleted += PageCrawlCompleted;
var uri = new Uri(domain);
var crawlResult = crawler.Crawl(uri);
}
}
private static void PageCrawlCompleted(object sender, PageCrawlCompletedArgs e)
{
var crawledPage = e.CrawledPage;
var crawlContext = e.CrawlContext;
var document = crawledPage.AngleSharpHtmlDocument;
var anchors = document.QuerySelectorAll("a").OfType<IHtmlAnchorElement>();
var hrefs = anchors.Select(x => x.Href).ToList();
var regEx= new Regex(@"https?:\/\/(www|smile)\.amazon(\.co\.uk|\.com).*");
var resultList = hrefs.Where(f => regEx.IsMatch(f)).ToList();
if (resultList.Any())
{
//NEED TO EXIT THE SITE CRAWL HERE
}
}
答案 0 :(得分:2)
我建议以下...
var myCancellationToken = new CancellationTokenSource();
crawler.CrawlAsync(someUri, myCancellationToken);
private static void PageCrawlCompleted(object sender, PageCrawlCompletedArgs e)
{
//More performant (since the parsing has already been done by Abot)
var hasAmazonLinks = e.CrawledPage.ParsedLinks
.Any(hl => hl.HrefValue.AbsoluteUri
.ToLower()
.Contains("amazon.com"));
if (hasAmazonLinks)
{
//LOG SOMETHING BEFORE YOU STOP THE CRAWL!!!!!
//Option A: Preferred method, Will clear all scheduled pages and cancel any threads that are currently crawling
myCancellationToken.Cancel();
//Option B: Same result as option A but no need to do anything with tokens. Not the preferred method.
e.CrawlContext.IsCrawlHardStopRequested = true;
//Option C: Will clear all scheduled pages but will allow any threads that are currently crawling to complete. No cancellation tokens needed. Consider it a soft stop to the crawl.
e.CrawlContext.IsCrawlStopRequested = true;
}
}
答案 1 :(得分:0)
PoliteWebCrawler
旨在开始抓取并更深入地浏览网站URL。如果您只想获取URL的内容(例如网站的首页),则可以使用专为此类工作而设计的PageRequester
。
var pageRequester = new PageRequester(new CrawlConfiguration(), new WebContentExtractor());
var crawledPage = await pageRequester.MakeRequestAsync(new Uri("http://google.com"));
Log.Logger.Information("{result}", new
{
url = crawledPage.Uri,
status = Convert.ToInt32(crawledPage.HttpResponseMessage.StatusCode)
});
顺便说一句,如果要在此过程中停止搜寻器,可以使用以下两种方法之一:
//1. hard crawl stop
crawlContext.CancellationTokenSource.Cancel();
//2. soft stop
crawlContext.IsCrawlStopRequested = true;