我写了一篇带有两个不同管道的scrapy爬虫。
class MySpider(scrapy.Spider):
name = 'general'
http_user = 'user'
http_pass = 'userpass'
def __init__(self, siteIndex, *args, **kwargs):
dispatcher.connect(self.spider_closed, signals.spider_closed)
dispatcher.connect(self.spider_opened, signals.spider_opened)
dispatcher.connect(self.request_scheduled, signals.request_scheduled)
dispatcher.connect(self.request_dropped, signals.request_dropped)
self.request_meta = {
'splash': {
'args': {
'html': 1,
'png': 0,
'images': 0,
'header': [{'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:54.0) '
'Gecko/20100101 Firefox/54.0'}],
'wait': 1,
'timeout': 60
},
}
}
def request_scheduled(self, request, spider):
pass
def request_dropped(self, request, spider):
pass
def spider_opened(self, spider):
pass
def spider_closed(self, spider):
pass
def start_requests(self):
for url in self.urls:
yield scrapy.Request(url=url, callback=self.parse, priority=0, meta=self.request_meta)
def parse(self, response):
if ok_to_continue:
item = MyCrawlerItem(response=response)
return item
根据this,我使用crawler.engine.crawl(request, spider)
是为了在管道中进行无阻塞请求(我已经使用request_scheduled
信号来确保已调度所有请求)。
class CrawlPipeline(object):
def process_item(self, item, spider):
if crawl_or_not(item):
for link in LinkExtractor(canonicalize=True, unique=True, allow=spider.allowed_domains).extract_links(
item['response']):
request = Request(
url=link.url,
callback=spider.parse,
meta=spider.request_meta
)
request.meta['parent_url'] = 'http://parent_url'
self.crawler.engine.crawl(request, spider)
do_some_other_stuff()
return item
但是当第一个请求被完全删除(即最后一个管道返回该项目)时,所有计划的请求都将被丢弃(根据request_dropped
信号)。
更新:
我意识到请求丢弃是因为飞溅的请求已被DUPEFILTER_CLASS
过滤,因此我在请求中添加了dont_filter=True
以捕获请求,并且我意识到所有请求的url都相同作为第一个网址。
我只是用管道中的值替换了spider.request_meta
,问题得以解决。我不明白为什么会导致问题。任何解释都将适用。