我想修改下载文件的输出文件夹并基于source code of files pipeline,file_path
可以覆盖,我尝试下面的代码,但似乎我没有工作。顺便说一句,我是python-scrapy的新手。
pipelines.py
from scrapy.pipelines.files import FilesPipeline
class secFilesPipeline(FilesPipeline):
def file_path(self, request, response=None, info=None):
## start of deprecation warning block (can be removed in the future)
def _warn():
from scrapy.exceptions import ScrapyDeprecationWarning
import warnings
warnings.warn('FilesPipeline.file_key(url) method is deprecated, please use '
'file_path(request, response=None, info=None) instead',
category=ScrapyDeprecationWarning, stacklevel=1)
# check if called from file_key with url as first argument
if not isinstance(request, Request):
_warn()
url = request
else:
url = request.url
# detect if file_key() method has been overridden
if not hasattr(self.file_key, '_base'):
_warn()
return self.file_key(url)
## end of deprecation warning block
media_guid = hashlib.sha1(to_bytes(url)).hexdigest() # change to request.url after deprecation
media_ext = os.path.splitext(url)[1] # change to request.url after deprecation
return 'test/%s%s' % (media_guid, media_ext)
settings.py
ITEM_PIPELINES = {
'myproject.pipelines.secFilesPipeline': 2,
'scrapy.pipelines.files.FilesPipeline': 1,
}
FILES_STORE = '/home/joseph/pdf'
预期产出:例如FILES_STORE +月+ filename.pdf = /home/joseph/pdf/September/filename.pdf
有什么想法吗?谢谢。
答案 0 :(得分:0)
根据documentation设置FILES_STORE
中的settings.py
值就足够了。
答案 1 :(得分:0)
您应该做的是覆盖file_path
方法。所以你走在正确的轨道上。但是在您的代码示例中似乎存在一些缩进问题。我也遇到了问题,因为当发生错误时,即使设置为DEBUG
,scrapy
也会默默地忽略文件保存。所以从简单开始,只需从file_path
返回一个简单的字符串进行测试。
def file_path(self, request, response=None, info=None):
url = request.url
media_guid = hashlib.sha1(to_bytes(url)).hexdigest() # change to request.url after deprecation
media_ext = os.path.splitext(url)[1] # change to request.url after deprecation
return 'full/%s%s' % (media_guid, media_ext)
您可以做的是定义FILES_STORE
的基本路径,例如/home/joseph/pdf
,然后更改file_path
方法,使其返回类似
return '%s/%s%s' % (month_text, media_guid, media_ext)
您必须拥有必要的日期功能才能明确设置month_text
。