默认情况下,Scrapy会在使用Cache-Control
保存的所有图片上设置2天(172800秒)ImagesPipeline
标题。我想将此值更新为2592000或30天。
原始的ImagesPipeline看起来像这样:
class ImagesPipeline(FilesPipeline):
...
@classmethod
def from_settings(cls, settings):
...
s3store = cls.STORE_SCHEMES['s3']
...
...
继承自定义STORE_SCHEMES
class FilesPipeline(MediaPipeline):
...
STORE_SCHEMES = {
'': FSFilesStore,
'file': FSFilesStore,
's3': S3FilesStore,
}
...
S3FilesStore看起来像这样:
class S3FilesStore(object):
...
POLICY = 'public-read'
HEADERS = {
'Cache-Control': 'max-age=172800',
}
我测试过只编辑原始scrapy类中的原始值并将其从172800更改为2592000.这样可以正常工作,当我测试时,所有图像上的缓存都会变为30天。但显然这不是一个好的解决方案,我想用我的自定义类覆盖它。
因此,为了能够覆盖S3FilesStore
HEADERS = {}
,我必须创建一个我调用CustomS3FilesStore
的自定义类来覆盖变量,然后创建一个自定义ImagesPipeline来设置CustomS3FilesStore
为s3store
。
我使用以下代码执行此操作:
# Override the default headers and policies with a 30 days cache
class CustomS3FilesStore(S3FilesStore):
POLICY = 'public-read'
HEADERS = {
'Cache-Control': 'max-age=2592000',
}
# Set S3 scheme to our own override class CustomS3FilesStore
class CustomImagesPipeline(ImagesPipeline):
@classmethod
def from_settings(cls, settings):
cls.MIN_WIDTH = settings.getint('IMAGES_MIN_WIDTH', 0)
cls.MIN_HEIGHT = settings.getint('IMAGES_MIN_HEIGHT', 0)
cls.EXPIRES = settings.getint('IMAGES_EXPIRES', 90)
cls.THUMBS = settings.get('IMAGES_THUMBS', {})
# Override the default value to our CustomS3FilesStore Class
s3store = CustomS3FilesStore
s3store.AWS_ACCESS_KEY_ID = settings['AWS_ACCESS_KEY_ID']
s3store.AWS_SECRET_ACCESS_KEY = settings['AWS_SECRET_ACCESS_KEY']
cls.IMAGES_URLS_FIELD = settings.get('IMAGES_URLS_FIELD', cls.DEFAULT_IMAGES_URLS_FIELD)
cls.IMAGES_RESULT_FIELD = settings.get('IMAGES_RESULT_FIELD', cls.DEFAULT_IMAGES_RESULT_FIELD)
store_uri = settings['IMAGES_STORE']
return cls(store_uri)
然后我在ITEM_PIPELINES
中的settings.py文件中使用我的CustomImagesPipeline,如下所示:
ITEM_PIPELINES = {
'condobot.pipelines.CustomImagesPipeline': 100,
...
}
结果:当我运行爬虫时,我遇到0错误,并且所有图像都被下载了。但图像的缓存标题仍然只有2天,或172800秒。我没有成功地覆盖这个环境。
任何想法我做错了什么?如何实际更改Scrapy图像的Cache-Control?
答案 0 :(得分:-1)
问题是你并没有真正覆盖S3FilesStore
的默认值。
*FilesStore
类已在STORE_SCHEMES
属性中注册 - 在from_settings
中仅用于获取AWS密钥。
尝试在构造函数中设置它,如下所示:
class CustomImagesPipeline(ImagesPipeline):
def __init__(self, *args, **kwargs):
super(CustomImagesPipeline, self).__init__(*args, **kwargs)
self.STORE_SCHEMES['s3'] = CustomS3FilesStore
...