我有一个从站点地图开始的抓取工具,抓取(一对)100个唯一的网址,然后在这100个网页上进行进一步处理。但是,我只在前10个网址上收到回调。蜘蛛日志似乎只在前10个网址上调用HTTP GET。
class MySpider(scrapy.spider.BaseSpider):
# settings ...
def parse(self, response):
urls = [...]
for url in urls:
request = scrapy.http.Request(url, callback=self.parse_part2)
print url
yield request
def parse_part2(self, response):
print response.url
# do more parsing here
我考虑过这些选项:
是否有一些我不知道的神秘的max_branching_factor标志?
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url1>
yay callback!
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url2>
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url3>
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url4>
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url5>
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url6>
yay callback!
yay callback!
yay callback!
yay callback!
yay callback!
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url7>
yay callback!
2015-02-11 02:05:13-0800 [mysite] DEBUG: Crawled (200) <GET url8>
yay callback!
2015-02-11 02:05:13-0800 [mysite] DEBUG: Crawled (200) <GET url9>
yay callback!
2015-02-11 02:05:13-0800 [mysite] DEBUG: Crawled (200) <GET url10>
yay callback!
2015-02-11 02:05:13-0800 [mysite] INFO: Closing spider (finished)
2015-02-11 02:05:13-0800 [mysite] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4590,
'downloader/request_count': 11,
'downloader/request_method_count/GET': 11,
'downloader/response_bytes': 638496,
'downloader/response_count': 11,
'downloader/response_status_count/200': 11,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 2, 11, 10, 5, 13, 260322),
'log_count/DEBUG': 17,
'log_count/INFO': 3,
'request_depth_max': 1,
'response_received_count': 11,
'scheduler/dequeued': 11,
'scheduler/dequeued/memory': 11,
'scheduler/enqueued': 11,
'scheduler/enqueued/memory': 11,
'start_time': datetime.datetime(2015, 2, 11, 10, 5, 12, 492811)}
2015-02-11 02:05:13-0800 [mysite] INFO: Spider closed (finished)
答案 0 :(得分:1)
所以我在我的一个设置文件中找到了这个属性
max_requests / MAX_REQUESTS = 10
负责蜘蛛早期退出(oops)
答案 1 :(得分:0)
尝试将LOG_LEVEL设置为调试,您将看到更多日志。
如果您这样做。请将它们粘贴在
上