要捕获所有重定向路径,包括已经抓取最终网址的时间,我写了一个自定义重复过滤器:
import logging
from scrapy.dupefilters import RFPDupeFilter
from seoscraper.items import RedirectionItem
class CustomURLFilter(RFPDupeFilter):
def __init__(self, path=None, debug=False):
super(CustomURLFilter, self).__init__(path, debug)
def request_seen(self, request):
request_seen = super(CustomURLFilter, self).request_seen(request)
if request_seen is True:
item = RedirectionItem()
item['sources'] = [ u for u in request.meta.get('redirect_urls', u'') ]
item['destination'] = request.url
return request_seen
现在,我如何将RedirectionItem直接发送到管道? 有没有办法从自定义过滤器实例化管道,以便我可以直接发送数据?或者我还应该创建一个自定义调度程序并从那里获取管道但是如何?