Scrapy:如何在不下载的情况下从自定义过滤器向管道发送数据

时间:2016-08-30 08:39:37

标签: scrapy scrapy-pipeline

要捕获所有重定向路径,包括已经抓取最终网址的时间,我写了一个自定义重复过滤器:

import logging

from scrapy.dupefilters import RFPDupeFilter
from seoscraper.items import RedirectionItem

class CustomURLFilter(RFPDupeFilter):

    def __init__(self, path=None, debug=False):
        super(CustomURLFilter, self).__init__(path, debug)

    def request_seen(self, request):
        request_seen = super(CustomURLFilter, self).request_seen(request)

        if request_seen is True:
            item = RedirectionItem()
            item['sources'] = [ u for u in request.meta.get('redirect_urls', u'') ]
            item['destination'] = request.url

        return request_seen

现在,我如何将RedirectionItem直接发送到管道? 有没有办法从自定义过滤器实例化管道,以便我可以直接发送数据?或者我还应该创建一个自定义调度程序并从那里获取管道但是如何?

0 个答案:

没有答案