我已经找到了这个问题的两个答案,但没有一个对我有用。基本上,我想限制每个域抓取的页面数量。这是实际抓取工具中的代码:
def parse_page(self, response)
visited_count.append(response.url.split('/')[2])
if visited_count.count(response.url.split('/')[2]) > 49:
print '!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!'
denied.append(response.url)
和自定义中间件:
class IgnoreDomain(object):
def process_requests(request, spider):
if request in spider.denied:
return IgnoreRequest()
else:
return None
当然,设置中会提到中间件。 如果你能指出我做错了什么,我真的很感激。
答案 0 :(得分:1)
你说I want to restrict the amount of pages crawled per domain
......
这样做,在你的蜘蛛中创建counter
class YourSpider(scrapy.Spider):
counter = {}
#counter will have values like {'google.com': 4, 'website.com': 2} etc
在中间件文件的顶部写下这个
from scrapy.exceptions import IgnoreRequest
import logging
class YourMiddleware(object):
def process_request(self, request, spider):
domain = tldextract.extract(request.url)[1]
logging.info(spider.counter)
if domain not in spider.counter:
pass #keep scraping this link
else:
if spider.counter[domain] > 5:
raise IgnoreRequest()
else:
pass #keep processing this request
def process_response(self, request, response, spider):
domain = tldextract.extract(request.url)[1]
if domain not in spider.counter:
spider.counter[domain] = 1
else:
spider.counter[domain] = spider.counter[domain] + 1
return response