我之前曾问过类似的问题(How does Scrapy avoid re-downloading media that was downloaded recently?),但由于我没有收到明确的答案,我会再次提出问题。
我已使用Scrapy的文件管道将大量文件下载到AWS S3存储桶。根据文档(https://doc.scrapy.org/en/latest/topics/media-pipeline.html#downloading-and-processing-files-and-images),这个管道避免重新下载最近下载的媒体"但是它没有说多久以前"最近"是或如何设置此参数。
查看https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py上FilesPipeline
课程的实施情况,看来这是从FILES_EXPIRES
设置获得的,默认设置为90天:
class FilesPipeline(MediaPipeline):
"""Abstract pipeline that implement the file downloading
This pipeline tries to minimize network transfers and file processing,
doing stat of the files and determining if file is new, uptodate or
expired.
`new` files are those that pipeline never processed and needs to be
downloaded from supplier site the first time.
`uptodate` files are the ones that the pipeline processed and are still
valid files.
`expired` files are those that pipeline already processed but the last
modification was made long time ago, so a reprocessing is recommended to
refresh it in case of change.
"""
MEDIA_NAME = "file"
EXPIRES = 90
STORE_SCHEMES = {
'': FSFilesStore,
'file': FSFilesStore,
's3': S3FilesStore,
}
DEFAULT_FILES_URLS_FIELD = 'file_urls'
DEFAULT_FILES_RESULT_FIELD = 'files'
def __init__(self, store_uri, download_func=None, settings=None):
if not store_uri:
raise NotConfigured
if isinstance(settings, dict) or settings is None:
settings = Settings(settings)
cls_name = "FilesPipeline"
self.store = self._get_store(store_uri)
resolve = functools.partial(self._key_for_pipe,
base_class_name=cls_name,
settings=settings)
self.expires = settings.getint(
resolve('FILES_EXPIRES'), self.EXPIRES
)
if not hasattr(self, "FILES_URLS_FIELD"):
self.FILES_URLS_FIELD = self.DEFAULT_FILES_URLS_FIELD
if not hasattr(self, "FILES_RESULT_FIELD"):
self.FILES_RESULT_FIELD = self.DEFAULT_FILES_RESULT_FIELD
self.files_urls_field = settings.get(
resolve('FILES_URLS_FIELD'), self.FILES_URLS_FIELD
)
self.files_result_field = settings.get(
resolve('FILES_RESULT_FIELD'), self.FILES_RESULT_FIELD
)
super(FilesPipeline, self).__init__(download_func=download_func, settings=settings)
@classmethod
def from_settings(cls, settings):
s3store = cls.STORE_SCHEMES['s3']
s3store.AWS_ACCESS_KEY_ID = settings['AWS_ACCESS_KEY_ID']
s3store.AWS_SECRET_ACCESS_KEY = settings['AWS_SECRET_ACCESS_KEY']
s3store.POLICY = settings['FILES_STORE_S3_ACL']
store_uri = settings['FILES_STORE']
return cls(store_uri, settings=settings)
def _get_store(self, uri):
if os.path.isabs(uri): # to support win32 paths like: C:\\some\dir
scheme = 'file'
else:
scheme = urlparse(uri).scheme
store_cls = self.STORE_SCHEMES[scheme]
return store_cls(uri)
def media_to_download(self, request, info):
def _onsuccess(result):
if not result:
return # returning None force download
last_modified = result.get('last_modified', None)
if not last_modified:
return # returning None force download
age_seconds = time.time() - last_modified
age_days = age_seconds / 60 / 60 / 24
if age_days > self.expires:
return # returning None force download
我能理解这一点吗?另外,我在age_days
类中没有看到与S3FilesStore
类似的布尔语句;是否也为S3上的文件实现了年龄检查? (我也无法找到测试S3的年龄检查功能的任何测试。)
答案 0 :(得分:2)
FILES_EXPIRES
确实是告诉FilesPipeline" old"可以在下载文件之前(再次)。
代码的关键部分位于media_to_download
:
_onsuccess
回调会检查管道self.store.stat_file
来电的结果,对于您的问题,它会特别查找" last_modified"信息。如果上次修改时间超过"到期日期",则会触发下载。
您可以查看how the S3store gets the "last modified" information。这取决于botocore是否可用。
答案 1 :(得分:1)
对此的一行答案是 - class FilesPipeline(MediaPipeline):
是唯一负责管理,验证和下载本地路径中文件的类。 class S3FilesStore(object):
只从本地路径获取文件并将其上传到S3。
class FSFilesStore
是管理所有本地路径的路径,FilesPipeline
使用它们将文件存储在本地路径。
链接:
https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py#L264 https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py#L397 https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py#L299