Scrapy覆盖FilesPipeline中的file_path

时间:2017-09-12 13:29:29

标签: python scrapy

我想修改下载文件的输出文件夹并基于source code of files pipelinefile_path可以覆盖,我尝试下面的代码,但似乎我没有工作。顺便说一句,我是python-scrapy的新手。

pipelines.py

from scrapy.pipelines.files import FilesPipeline

class secFilesPipeline(FilesPipeline):
    def file_path(self, request, response=None, info=None):
    ## start of deprecation warning block (can be removed in the future)
    def _warn():
        from scrapy.exceptions import ScrapyDeprecationWarning
        import warnings
        warnings.warn('FilesPipeline.file_key(url) method is deprecated, please use '
                      'file_path(request, response=None, info=None) instead',
                      category=ScrapyDeprecationWarning, stacklevel=1)

    # check if called from file_key with url as first argument
    if not isinstance(request, Request):
        _warn()
        url = request
    else:
        url = request.url

    # detect if file_key() method has been overridden
    if not hasattr(self.file_key, '_base'):
        _warn()
        return self.file_key(url)
    ## end of deprecation warning block

    media_guid = hashlib.sha1(to_bytes(url)).hexdigest()  # change to request.url after deprecation
    media_ext = os.path.splitext(url)[1]  # change to request.url after deprecation
    return 'test/%s%s' % (media_guid, media_ext)

settings.py

ITEM_PIPELINES = {
'myproject.pipelines.secFilesPipeline': 2,
'scrapy.pipelines.files.FilesPipeline': 1,
}

FILES_STORE = '/home/joseph/pdf'

预期产出:例如FILES_STORE +月+ filename.pdf = /home/joseph/pdf/September/filename.pdf

有什么想法吗?谢谢。

2 个答案:

答案 0 :(得分:0)

根据documentation设置FILES_STORE中的settings.py值就足够了。

答案 1 :(得分:0)

您应该做的是覆盖file_path方法。所以你走在正确的轨道上。但是在您的代码示例中似乎存在一些缩进问题。我也遇到了问题,因为当发生错误时,即使设置为DEBUGscrapy也会默默地忽略文件保存。所以从简单开始,只需从file_path返回一个简单的字符串进行测试。

the files pipeline file ...

def file_path(self, request, response=None, info=None):
    url = request.url
    media_guid = hashlib.sha1(to_bytes(url)).hexdigest()  # change to request.url after deprecation
    media_ext = os.path.splitext(url)[1]  # change to request.url after deprecation
    return 'full/%s%s' % (media_guid, media_ext)

您可以做的是定义FILES_STORE的基本路径,例如/home/joseph/pdf,然后更改file_path方法,使其返回类似

的内容
    return '%s/%s%s' % (month_text, media_guid, media_ext)

您必须拥有必要的日期功能才能明确设置month_text