我使用scrapy,试图使用项目和自定义图像管道下载图像。蜘蛛程序应收集图像源并将其存储在项目字段中,然后执行相同操作,转到链接后的下一页。 它可以按预期工作,并下载所有需要用图像管道“ checksum.png”的标准行为命名的图像。 在寻找其他类似问题时,我尝试使自定义图像管道覆盖某些方法,并粘贴代码,但我得到的最好的结果是放置了一个没有图像的物品。
这是我被困在这里的一个星期。 pipeline.py弄乱了到目前为止我尝试过的所有代码... 感谢您的帮助。
较旧的stackoverflow问题所包含的代码由于过时而不再有效,或者我可以在我的项目中正确实现。
蜘蛛代码:
import scrapy
from .. import items
class ImmortaleSpider(scrapy.Spider):
name = 'immortale'
allowed_domains = ['www.mangaeden.com']
start_urls = ['https://www.mangaeden.com/en/it-manga/limmortale/0/1/']
def parse(self, response):
item = items.MangascraperItem()
urls_list = []
name_list = []
for url in response.xpath('//img[@id="mainImg"]/@src').extract():
urls_list.append("https:" + url)
item['image_urls'] = urls_list
name_list.append(response.url.split("/")[-3] + "-" + response.url.split("/")[-2])
item['image_names'] = name_list
yield item
next_page = response.xpath('//a[@class="ui-state-default next"]/@href').extract()
if next_page:
next_href = next_page[0]
next_page_url = response.urljoin(next_href)
request = scrapy.Request(url=next_page_url)
yield request
settings.py:
ITEM_PIPELINES = {
# 'scrapy.pipelines.images.ImagesPipeline': 2,
'mangascraper.pipelines.MyImagesPipeline': 1,
}
IMAGES_STORE = 'images/immortale/'
items.py:
import scrapy
class MangascraperItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
image_names = scrapy.Field()
pipeline.py:
import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
class MyImagesPipeline(ImagesPipeline):
# def get_media_requests(self, item, info):
# for img_url in item['image_urls']:
# # meta = {'filename': item['image_name']}
# meta = {'item':item}
# yield scrapy.Request(url=img_url, meta=meta)
# def file_path(self, request, response=None, info=None):
# return scrapy.Request.meta.get('filename','')
#
# def get_media_requests(self, item, info):
# return [scrapy.Request(url, meta={'filename':item.get('image_name')}) for url in item.get(self.images_urls_field, [])]
# def file_path(self, request, response=None, info=None):
# print "\n" + scrapy.Request.meta['filename'] + "\n"
# return scrapy.Request.meta['filename']
#
# def get_media_requests(self, item, info):
# img_url = item['image_urls'][0]
# meta = {'filename': item['image_names']}
# yield scrapy.Request(url=img_url, meta=meta)
#
# def item_completed(self, results, item, info):
# image_paths = [x['path'] for ok, x in results if ok]
# if not image_paths:
# raise DropItem("Item contains no images")
# item['images'] = image_paths
# return item
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield scrapy.Request(url=image_url, meta={'item': item})
#
# def file_path(self, request, response=None, info=None):
# item = scrapy.Request.meta['item']
# image_guid = item['image_names']
# print image_guid
# return 'full/%s.jpg' % (image_guid)
#
def file_path(self, request, response=None, info=None):
image_guid = scrapy.response.url.split("/")[-3] + "-" + response.url.split("/")[-2]
return 'full/%s.jpg' % (image_guid)
# def get_media_requests(self, item, info):
# yield scrapy.Request(item['image_urls'])
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
return item
我希望下载并保存的图像的文件名存储在该项目的“ image_names”字段中,该字段的填充方式为“章节页面”格式(1-25、1-26、1-27、2- 1,2-2,依此类推)以保持页面顺序。
如果您对工作流程有任何参考,或者对管道内各个阶段的步骤进行了逐步说明,对我会很有用。
更新1:使用os.rename找到解决方法
pipeline.py:
import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.utils.project import get_project_settings
import os
class MyImagesPipeline(ImagesPipeline):
IMAGES_STORE = get_project_settings().get("IMAGES_STORE")
def item_completed(self, result, item, info):
image_path = [x["path"] for ok, x in result if ok]
os.rename(self.IMAGES_STORE + "/" + image_path[0], self.IMAGES_STORE + "/full/" + item["image_names"][0] + ".jpg")
return item
# TODO: try to override file_path instead of os.rename