Question

我试图将默认路径full/hash.jpg覆盖为<dynamic>/hash.jpg，我已使用以下代码尝试How to download scrapy images in a dyanmic folder：

def item_completed(self, results, item, info):

    for result in [x for ok, x in results if ok]:
        path = result['path']
        # here we create the session-path where the files should be in the end
        # you'll have to change this path creation depending on your needs
        slug = slugify(item['category'])
        target_path = os.path.join(slug, os.path.basename(path))

        # try to move the file and raise exception if not possible
        if not os.rename(path, target_path):
            raise DropItem("Could not move image to target folder")

    if self.IMAGES_RESULT_FIELD in item.fields:
        item[self.IMAGES_RESULT_FIELD] = [x for ok, x in results if ok]
    return item

但我明白了：

Traceback (most recent call last):
    File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
    File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 839, in _cbDeferred
    self.callback(self.resultList)
    File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 382, in callback
    self._startRunCallbacks(result)
    File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 490, in _startRunCallbacks
    self._runCallbacks()
    --- <exception caught here> ---
    File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
    File "/home/user/Projects/sepid/scraper/scraper/pipelines.py", line 44, in item_completed
    if not os.rename(path, target_path):
    exceptions.OSError: [Errno 2] No such file or directory

我不知道什么是错的，还有其他方法可以改变这条路吗？感谢

Answer 1

我创建了一个继承自ImagesPipeline和重写file_path方法的管道，并使用它而不是标准ImagesPipeline

class StoreImgPipeline(ImagesPipeline):
    def file_path(self, request, response=None, info=None):
        image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
        return 'realty-sc/%s/%s/%s/%s.jpg' % (YEAR, image_guid[:2], image_guid[2:4], image_guid)

Answer 2

问题因为dst文件夹不存在而引发，快速解决方案是：

def item_completed(self, results, item, info):

    for result in [x for ok, x in results if ok]:
        path = result['path']
        slug = slugify(item['designer'])


        settings = get_project_settings()
        storage = settings.get('IMAGES_STORE')

        target_path = os.path.join(storage, slug, os.path.basename(path))
        path = os.path.join(storage, path)

        # If path doesn't exist, it will be created
        if not os.path.exists(os.path.join(storage, slug)):
            os.makedirs(os.path.join(storage, slug))

        if not os.rename(path, target_path):
            raise DropItem("Could not move image to target folder")

    if self.IMAGES_RESULT_FIELD in item.fields:
        item[self.IMAGES_RESULT_FIELD] = [x for ok, x in results if ok]
    return item

Answer 3

要在下载图像之前动态设置scrapy spider下载的图像的路径，而不是之后移动它，我创建了一个覆盖get_media_requests和file_path方法的自定义管道。

class MyImagesPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        return [Request(url, meta={'f1':item.get('field1'), 'f2':item.get('field2'), 'f3':item.get('field3'), 'f4':item.get('field4')}) for url in item.get(self.images_urls_field, [])]

    def file_path(self, request, response=None, info=None):
        ## start of deprecation warning block (can be removed in the future)
        def _warn():
            from scrapy.exceptions import ScrapyDeprecationWarning
            import warnings
            warnings.warn('ImagesPipeline.image_key(url) and file_key(url) methods are deprecated, '
                      'please use file_path(request, response=None, info=None) instead',
                      category=ScrapyDeprecationWarning, stacklevel=1)

        # check if called from image_key or file_key with url as first argument
        if not isinstance(request, Request):
            _warn()
            url = request
        else:
            url = request.url

        # detect if file_key() or image_key() methods have been overridden
        if not hasattr(self.file_key, '_base'):
            _warn()
            return self.file_key(url)
        elif not hasattr(self.image_key, '_base'):
            _warn()
            return self.image_key(url)
        ## end of deprecation warning block

        image_guid = hashlib.sha1(to_bytes(url)).hexdigest()
        return '%s/%s/%s/%s/%s.jpg' % (request.meta['f1'], request.meta['f2'], request.meta['f3'], request.meta['f4'], image_guid)

此方法假设您在蜘蛛中定义scrapy.Item并替换，例如＆＃34; field1＆＃34;使用您的特定字段名称。在get_media_requests中设置Request.meta允许项目字段值用于设置每个项目的下载目录，如file_path的return语句所示。如果目录不存在，Scrapy会自动创建目录。

自定义管道类定义保存在我的项目pipelines.py中。这里的方法直接来自默认的scrapy管道images.py，它在我的Mac上存储在~/anaconda3/pkgs/scrapy-1.5.0-py36_0/lib/python3.6/site-packages/scrapy/pipelines/中。可以根据需要从该文件中复制包含和其他方法。

Answer 4

@neelix提供的解决方案是最好的解决方案，但我试图使用它，我发现了一些奇怪的结果，一些文件被移动但不是所有文件。所以我换了：

if not os.rename(path, target_path):
            raise DropItem("Could not move image to target folder")

我导入了shutil库，然后我的代码是：

def item_completed(self, results, item, info):

    for result in [x for ok, x in results if ok]:
        path = result['path']
        slug = slugify(item['designer'])


        settings = get_project_settings()
        storage = settings.get('IMAGES_STORE')

        target_path = os.path.join(storage, slug, os.path.basename(path))
        path = os.path.join(storage, path)

        # If path doesn't exist, it will be created
        if not os.path.exists(os.path.join(storage, slug)):
            os.makedirs(os.path.join(storage, slug))

        shutil.move(path, target_path)

    if self.IMAGES_RESULT_FIELD in item.fields:
        item[self.IMAGES_RESULT_FIELD] = [x for ok, x in results if ok]
    return item

我希望它对你们也有效：）

如何在动态文件夹中下载scrapy图像

4 个答案: