使用Crawler4j进行多线程Web爬网:缺少页面

时间:2015-07-10 18:09:33

标签: java web-crawler scrapy-spider crawler4j

我正在使用多线程抓取工具Crawler4j来抓取某些网站。此爬网程序允许用户定义要在网站上运行的爬网程序的线程数。我决定将抓取工具运行到depth/layer = 10,然后爬到501 pages per depth。通过此设置,我使用# of threads set to 1然后# of threads set to 5运行抓取工具。对于一个网站,我记录了layer number# of pages crawled in that layer。数据如下 -

**Crawler with 1 thread**
Layer: 0 frequency: 0
Layer: 1 frequency: 41
Layer: 2 frequency: 306
Layer: 3 frequency: 416
Layer: 4 frequency: 501
Layer: 5 frequency: 501
Layer: 6 frequency: 199
Layer: 7 frequency: 113
Layer: 8 frequency: 6
Layer: 9 frequency: 0
Layer: 10 frequency: 0

**Crawler with 5 threads**
Layer: 0 frequency: 0
Layer: 1 frequency: 41
Layer: 2 frequency: 268
Layer: 3 frequency: 384
Layer: 4 frequency: 501
Layer: 5 frequency: 501
Layer: 6 frequency: 338
Layer: 7 frequency: 501
Layer: 8 frequency: 501
Layer: 9 frequency: 501
Layer: 10 frequency: 501

前两层很好 - 抓取这两层的网页频率是相同的。但从那时起,我看到了一些差异。我的问题是:两次运行的频率数据是否有差异(1个线程和5个线程)?如果是的话,它背后的逻辑是什么,如果不是,那会导致什么呢?

0 个答案:

没有答案