我试图将默认路径full/hash.jpg
覆盖为<dynamic>/hash.jpg
,我已使用以下代码尝试How to download scrapy images in a dyanmic folder:
def item_completed(self, results, item, info):
for result in [x for ok, x in results if ok]:
path = result['path']
# here we create the session-path where the files should be in the end
# you'll have to change this path creation depending on your needs
slug = slugify(item['category'])
target_path = os.path.join(slug, os.path.basename(path))
# try to move the file and raise exception if not possible
if not os.rename(path, target_path):
raise DropItem("Could not move image to target folder")
if self.IMAGES_RESULT_FIELD in item.fields:
item[self.IMAGES_RESULT_FIELD] = [x for ok, x in results if ok]
return item
但我明白了:
Traceback (most recent call last):
File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 839, in _cbDeferred
self.callback(self.resultList)
File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 382, in callback
self._startRunCallbacks(result)
File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 490, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/home/user/Projects/sepid/scraper/scraper/pipelines.py", line 44, in item_completed
if not os.rename(path, target_path):
exceptions.OSError: [Errno 2] No such file or directory
我不知道什么是错的,还有其他方法可以改变这条路吗?感谢
答案 0 :(得分:7)
我创建了一个继承自ImagesPipeline
和重写file_path
方法的管道,并使用它而不是标准ImagesPipeline
class StoreImgPipeline(ImagesPipeline):
def file_path(self, request, response=None, info=None):
image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
return 'realty-sc/%s/%s/%s/%s.jpg' % (YEAR, image_guid[:2], image_guid[2:4], image_guid)
答案 1 :(得分:2)
问题因为dst文件夹不存在而引发,快速解决方案是:
def item_completed(self, results, item, info):
for result in [x for ok, x in results if ok]:
path = result['path']
slug = slugify(item['designer'])
settings = get_project_settings()
storage = settings.get('IMAGES_STORE')
target_path = os.path.join(storage, slug, os.path.basename(path))
path = os.path.join(storage, path)
# If path doesn't exist, it will be created
if not os.path.exists(os.path.join(storage, slug)):
os.makedirs(os.path.join(storage, slug))
if not os.rename(path, target_path):
raise DropItem("Could not move image to target folder")
if self.IMAGES_RESULT_FIELD in item.fields:
item[self.IMAGES_RESULT_FIELD] = [x for ok, x in results if ok]
return item
答案 2 :(得分:0)
要在下载图像之前动态设置scrapy spider下载的图像的路径,而不是之后移动它,我创建了一个覆盖get_media_requests
和file_path
方法的自定义管道。
class MyImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
return [Request(url, meta={'f1':item.get('field1'), 'f2':item.get('field2'), 'f3':item.get('field3'), 'f4':item.get('field4')}) for url in item.get(self.images_urls_field, [])]
def file_path(self, request, response=None, info=None):
## start of deprecation warning block (can be removed in the future)
def _warn():
from scrapy.exceptions import ScrapyDeprecationWarning
import warnings
warnings.warn('ImagesPipeline.image_key(url) and file_key(url) methods are deprecated, '
'please use file_path(request, response=None, info=None) instead',
category=ScrapyDeprecationWarning, stacklevel=1)
# check if called from image_key or file_key with url as first argument
if not isinstance(request, Request):
_warn()
url = request
else:
url = request.url
# detect if file_key() or image_key() methods have been overridden
if not hasattr(self.file_key, '_base'):
_warn()
return self.file_key(url)
elif not hasattr(self.image_key, '_base'):
_warn()
return self.image_key(url)
## end of deprecation warning block
image_guid = hashlib.sha1(to_bytes(url)).hexdigest()
return '%s/%s/%s/%s/%s.jpg' % (request.meta['f1'], request.meta['f2'], request.meta['f3'], request.meta['f4'], image_guid)
此方法假设您在蜘蛛中定义scrapy.Item
并替换,例如&#34; field1&#34;使用您的特定字段名称。在get_media_requests
中设置Request.meta允许项目字段值用于设置每个项目的下载目录,如file_path
的return语句所示。如果目录不存在,Scrapy会自动创建目录。
自定义管道类定义保存在我的项目pipelines.py
中。这里的方法直接来自默认的scrapy管道images.py
,它在我的Mac上存储在~/anaconda3/pkgs/scrapy-1.5.0-py36_0/lib/python3.6/site-packages/scrapy/pipelines/
中。可以根据需要从该文件中复制包含和其他方法。
答案 3 :(得分:-1)
if not os.rename(path, target_path):
raise DropItem("Could not move image to target folder")
我导入了shutil库,然后我的代码是:
def item_completed(self, results, item, info):
for result in [x for ok, x in results if ok]:
path = result['path']
slug = slugify(item['designer'])
settings = get_project_settings()
storage = settings.get('IMAGES_STORE')
target_path = os.path.join(storage, slug, os.path.basename(path))
path = os.path.join(storage, path)
# If path doesn't exist, it will be created
if not os.path.exists(os.path.join(storage, slug)):
os.makedirs(os.path.join(storage, slug))
shutil.move(path, target_path)
if self.IMAGES_RESULT_FIELD in item.fields:
item[self.IMAGES_RESULT_FIELD] = [x for ok, x in results if ok]
return item
我希望它对你们也有效:)