AbotX:如何创建一个并行的爬虫,可以在运行时从新请求中添加

时间:2016-07-11 10:38:17

标签: web-crawler

我将AlwaysOnSiteToCrawlProvider设置为单身,并将.AddSitesToCrawl()设置为单身并传递给ParallelCrawlerEngine。

我可以实例化它并让它无所事事。 然后我可以添加一个网站,它会爬行好。 如果我然后添加另一个站点,它将不会抓取第二个站点。

我查看了网站上的example,但它似乎并没有显示如何工作,并在初始执行后添加了新项目。使用[2016-07-11 11:17:18,361] [20 ] [INFO ] - Crawl for domain [http://www.existingSite.com/] completed successfully in [0.0001118] seconds, crawled [361] pages 将它们添加到列表中,但它们似乎处于不被读取的炼狱状态。

通过查看日志,我得到一个网站已完成'即使网站尚未重新抓取,也会显示消息

[2016-07-11 11:17:33,365] [23 ] [ERROR] - Crawl for domain [http://www.newSite.com/] failed after [0.0066498] seconds, crawled [361] pages
[2016-07-11 11:17:33,365] [23 ] [ERROR] - System.InvalidOperationException: Cannot call DoWork() after AbortAll() or Dispose() have been called.
   at Abot.Util.ThreadManager.DoWork(Action action)
   at Abot.Crawler.WebCrawler.CrawlSite()
   at Abot.Crawler.WebCrawler.Crawl(Uri uri, CancellationTokenSource cancellationTokenSource)

如果我添加新网站,我会收到错误

import org.apache.commons.lang3.StringUtils;

public class Demo2 {
  public static void main(String...args){

    char [] a = {'a','b','c','d'};
    String str = String.valueOf(a);//convert to 
    String[] input = str.split("");//string array 
    String str1 = StringUtils.join(input,""); // join to "abcd" for the first half of output
    String str2 = StringUtils.join(input," "); // join to "a b c d" for the second half of output
    for(int i = 1; i<str1.length()+1; i++){
        System.out.println(str1.substring(0, i)+" "+str1.substring(i, str1.length())); //insert " "at index i
    }
    System.out.println(str2);
    for(int i = 1; i<str2.length()-1; i=i+2){
        System.out.println(str2.substring(0, i)+""+str2.substring(i+1, str2.length())); //remove " " at index i
    }        
  }
}

0 个答案:

没有答案