如何避免在Scrapy中将媒体重新下载到S3?

时间:2017-06-29 11:34:19

标签: python amazon-s3 scrapy

我之前曾问过类似的问题(How does Scrapy avoid re-downloading media that was downloaded recently?),但由于我没有收到明确的答案,我会再次提出问题。

我已使用Scrapy的文件管道将大量文件下载到AWS S3存储桶。根据文档(https://doc.scrapy.org/en/latest/topics/media-pipeline.html#downloading-and-processing-files-and-images),这个管道避免重新下载最近下载的媒体"但是它没有说多久以前"最近"是或如何设置此参数。

查看https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.pyFilesPipeline课程的实施情况,看来这是从FILES_EXPIRES设置获得的,默认设置为90天:

class FilesPipeline(MediaPipeline):
    """Abstract pipeline that implement the file downloading
    This pipeline tries to minimize network transfers and file processing,
    doing stat of the files and determining if file is new, uptodate or
    expired.
    `new` files are those that pipeline never processed and needs to be
        downloaded from supplier site the first time.
    `uptodate` files are the ones that the pipeline processed and are still
        valid files.
    `expired` files are those that pipeline already processed but the last
        modification was made long time ago, so a reprocessing is recommended to
        refresh it in case of change.
    """

    MEDIA_NAME = "file"
    EXPIRES = 90
    STORE_SCHEMES = {
        '': FSFilesStore,
        'file': FSFilesStore,
        's3': S3FilesStore,
    }
    DEFAULT_FILES_URLS_FIELD = 'file_urls'
    DEFAULT_FILES_RESULT_FIELD = 'files'

    def __init__(self, store_uri, download_func=None, settings=None):
        if not store_uri:
            raise NotConfigured

        if isinstance(settings, dict) or settings is None:
            settings = Settings(settings)

        cls_name = "FilesPipeline"
        self.store = self._get_store(store_uri)
        resolve = functools.partial(self._key_for_pipe,
                                    base_class_name=cls_name,
                                    settings=settings)
        self.expires = settings.getint(
            resolve('FILES_EXPIRES'), self.EXPIRES
        )
        if not hasattr(self, "FILES_URLS_FIELD"):
            self.FILES_URLS_FIELD = self.DEFAULT_FILES_URLS_FIELD
        if not hasattr(self, "FILES_RESULT_FIELD"):
            self.FILES_RESULT_FIELD = self.DEFAULT_FILES_RESULT_FIELD
        self.files_urls_field = settings.get(
            resolve('FILES_URLS_FIELD'), self.FILES_URLS_FIELD
        )
        self.files_result_field = settings.get(
            resolve('FILES_RESULT_FIELD'), self.FILES_RESULT_FIELD
        )

        super(FilesPipeline, self).__init__(download_func=download_func, settings=settings)

    @classmethod
    def from_settings(cls, settings):
        s3store = cls.STORE_SCHEMES['s3']
        s3store.AWS_ACCESS_KEY_ID = settings['AWS_ACCESS_KEY_ID']
        s3store.AWS_SECRET_ACCESS_KEY = settings['AWS_SECRET_ACCESS_KEY']
        s3store.POLICY = settings['FILES_STORE_S3_ACL']

        store_uri = settings['FILES_STORE']
        return cls(store_uri, settings=settings)

    def _get_store(self, uri):
        if os.path.isabs(uri):  # to support win32 paths like: C:\\some\dir
            scheme = 'file'
        else:
            scheme = urlparse(uri).scheme
        store_cls = self.STORE_SCHEMES[scheme]
        return store_cls(uri)

    def media_to_download(self, request, info):
        def _onsuccess(result):
            if not result:
                return  # returning None force download

            last_modified = result.get('last_modified', None)
            if not last_modified:
                return  # returning None force download

            age_seconds = time.time() - last_modified
            age_days = age_seconds / 60 / 60 / 24
            if age_days > self.expires:
                return  # returning None force download

我能理解这一点吗?另外,我在age_days类中没有看到与S3FilesStore类似的布尔语句;是否也为S3上的文件实现了年龄检查? (我也无法找到测试S3的年龄检查功能的任何测试。)

2 个答案:

答案 0 :(得分:2)

FILES_EXPIRES确实是告诉FilesPipeline" old"可以在下载文件之前(再次)。

代码的关键部分位于media_to_download_onsuccess回调会检查管道self.store.stat_file来电的结果,对于您的问题,它会特别查找" last_modified"信息。如果上次修改时间超过"到期日期",则会触发下载。

您可以查看how the S3store gets the "last modified" information。这取决于botocore是否可用。

答案 1 :(得分:1)

对此的一行答案是 - class FilesPipeline(MediaPipeline):是唯一负责管理,验证和下载本地路径中文件的类。 class S3FilesStore(object):只从本地路径获取文件并将其上传到S3。

class FSFilesStore是管理所有本地路径的路径,FilesPipeline使用它们将文件存储在本地路径。

链接:

https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py#L264 https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py#L397 https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py#L299