在Scrapy ImagesPipeline

时间:2015-05-07 15:58:42

标签: python scrapy scrapy-spider

默认情况下,Scrapy会在使用Cache-Control保存的所有图片上设置2天(172800秒)ImagesPipeline标题。我想将此值更新为2592000或30天。

原始的ImagesPipeline看起来像这样:

class ImagesPipeline(FilesPipeline):
    ...
    @classmethod
    def from_settings(cls, settings):
        ...
        s3store = cls.STORE_SCHEMES['s3']
        ...
...

继承自定义STORE_SCHEMES

的FilesPipeline
class FilesPipeline(MediaPipeline):
    ...
    STORE_SCHEMES = {
        '': FSFilesStore,
        'file': FSFilesStore,
        's3': S3FilesStore,
    }
    ...

S3FilesStore看起来像这样:

class S3FilesStore(object):
    ...
    POLICY = 'public-read'
    HEADERS = {
        'Cache-Control': 'max-age=172800',
    }

我测试过只编辑原始scrapy类中的原始值并将其从172800更改为2592000.这样可以正常工作,当我测试时,所有图像上的缓存都会变为30天。但显然这不是一个好的解决方案,我想用我的自定义类覆盖它。

因此,为了能够覆盖S3FilesStore HEADERS = {},我必须创建一个我调用CustomS3FilesStore的自定义类来覆盖变量,然后创建一个自定义ImagesPipeline来设置CustomS3FilesStores3store

我使用以下代码执行此操作:

# Override the default headers and policies with a 30 days cache
class CustomS3FilesStore(S3FilesStore):
    POLICY = 'public-read'
    HEADERS = {
        'Cache-Control': 'max-age=2592000',
    }

# Set S3 scheme to our own override class CustomS3FilesStore
class CustomImagesPipeline(ImagesPipeline):

    @classmethod
    def from_settings(cls, settings):
        cls.MIN_WIDTH = settings.getint('IMAGES_MIN_WIDTH', 0)
        cls.MIN_HEIGHT = settings.getint('IMAGES_MIN_HEIGHT', 0)
        cls.EXPIRES = settings.getint('IMAGES_EXPIRES', 90)
        cls.THUMBS = settings.get('IMAGES_THUMBS', {})

        # Override the default value to our CustomS3FilesStore Class
        s3store = CustomS3FilesStore
        s3store.AWS_ACCESS_KEY_ID = settings['AWS_ACCESS_KEY_ID']
        s3store.AWS_SECRET_ACCESS_KEY = settings['AWS_SECRET_ACCESS_KEY']

        cls.IMAGES_URLS_FIELD = settings.get('IMAGES_URLS_FIELD', cls.DEFAULT_IMAGES_URLS_FIELD)
        cls.IMAGES_RESULT_FIELD = settings.get('IMAGES_RESULT_FIELD', cls.DEFAULT_IMAGES_RESULT_FIELD)
        store_uri = settings['IMAGES_STORE']
        return cls(store_uri)

然后我在ITEM_PIPELINES中的settings.py文件中使用我的CustomImagesPipeline,如下所示:

ITEM_PIPELINES = {
    'condobot.pipelines.CustomImagesPipeline': 100,
    ...
}

结果:当我运行爬虫时,我遇到0错误,并且所有图像都被下载了。但图像的缓存标题仍然只有2天,或172800秒。我没有成功地覆盖这个环境。

任何想法我做错了什么?如何实际更改Scrapy图像的Cache-Control?

1 个答案:

答案 0 :(得分:-1)

问题是你并没有真正覆盖S3FilesStore的默认值。

*FilesStore类已在STORE_SCHEMES属性中注册 - 在from_settings中仅用于获取AWS密钥。

尝试在构造函数中设置它,如下所示:

class CustomImagesPipeline(ImagesPipeline):

    def __init__(self, *args, **kwargs):
        super(CustomImagesPipeline, self).__init__(*args, **kwargs)
        self.STORE_SCHEMES['s3'] = CustomS3FilesStore

    ...