我看到了所有问题here,但我还不明白。
除了使用de code de image以外,我还可以通过de代码下面的代码执行操作,所以我尝试在items.py
文件中更改名称,请检查其中的注释。
settings.py
SPIDER_MODULES = ['xxx.spiders']
NEWSPIDER_MODULE = 'xxx.spiders'
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
IMAGES_STORE = '/home/magicnt/xxx/images'
items.py
class XxxItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
image_urls = scrapy.Field()
#images = scrapy.Field()<---with that code work with default name images
images = title<--- I try rename here, but not work
spider.py
from xxx.items import XxxItem
import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
class CoverSpider(scrapy.Spider):
name = "pyimagesearch-cover-spider"
start_urls = ['https://xxx.com.br/product']
def parse(self, response):
for bimb in response.css('#mod_imoveis_result'):
imageURL = bimb.xpath('./div[@id="g-img-imo"]/div[@class="img_p_results"]/img/@src').extract_first()
title = bimb.css('#titulo_imovel::text').extract_first()
yield {
'image_urls' : [response.urljoin(imageURL)],
'title' : title
}
next_page = response.xpath('//a[contains(@class, "num_pages") and contains(@class, "pg_number_next")]/@href').extract_first()
yield response.follow(next_page, self.parse)
我的目标是使用标题从项目中重命名下载的图像。欢迎为这个目标提供任何提示。
我对python和oo完全陌生,我通常会使用结构化php进行抓取,但意识到它有多好的抓取能力,请耐心等待并提供帮助。
答案 0 :(得分:0)
我的代码基于Scrapy Image Pipeline: How to rename images?,我一周前对其进行了测试,并且可以在我自己的蜘蛛上运行。
# This pipeline is designed for an item with multiple images
class ImagesWithNamesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
# values in field "image_name" must have suffix ".jpg"
# you can only change "image_name" to your own image name filed "images"
# however it should be a list
for (image_url, image_name) in zip(item[self.IMAGES_URLS_FIELD], item["image_names"]):
yield scrapy.Request(url=image_url, meta={"image_name": image_name})
def file_path(self, request, response=None, info=None):
image_name = request.meta["image_name"]
return image_name
ImagePipeline
的工作方式如下:
管道将依次执行image_downloaded
-> get_images
-> file_path
。 (“->”表示调用)
image_downloaded
:保存get_images
通过调用persist_file
返回的图像get_images
:将图像转换为JPEG file_path
:返回图像的相对路径 我浏览了the source code of ImagePipeline,发现没有用于重命名图像的特殊字段。 Scrapy将通过以下方式对其进行重命名:
def file_path(self, request, response=None, info=None):
image_guid = hashlib.sha1(to_bytes(url)).hexdigest() # change to request.url after deprecation
return 'full/%s.jpg' % (image_guid)
因此,我们应该重写方法file_path
。根据ImagePipeline继承的the source code of FilePipeline,我们只需要返回相对路径,persist_file
就可以完成工作。