Scrapy Image管道覆盖了来自项目字段的自定义图像名称

时间:2019-05-26 12:03:18

标签: python-2.7 scrapy scrapy-pipeline

我使用scrapy,试图使用项目和自定义图像管道下载图像。蜘蛛程序应收集图像源并将其存储在项目字段中,然后执行相同操作,转到链接后的下一页。 它可以按预期工作,并下载所有需要用图像管道“ checksum.png”的标准行为命名的图像。 在寻找其他类似问题时,我尝试使自定义图像管道覆盖某些方法,并粘贴代码,但我得到的最好的结果是放置了一个没有图像的物品。

这是我被困在这里的一个星期。 pipeline.py弄乱了到目前为止我尝试过的所有代码... 感谢您的帮助。

较旧的stackoverflow问题所包含的代码由于过时而不再有效,或者我可以在我的项目中正确实现。

蜘蛛代码:

import scrapy
from .. import items

class ImmortaleSpider(scrapy.Spider):
    name = 'immortale'
    allowed_domains = ['www.mangaeden.com']
    start_urls = ['https://www.mangaeden.com/en/it-manga/limmortale/0/1/']

def parse(self, response):
    item = items.MangascraperItem()
    urls_list = []
    name_list = []
    for url in response.xpath('//img[@id="mainImg"]/@src').extract():
        urls_list.append("https:" + url)
    item['image_urls'] = urls_list
    name_list.append(response.url.split("/")[-3] + "-" + response.url.split("/")[-2])
    item['image_names'] = name_list
    yield item

    next_page = response.xpath('//a[@class="ui-state-default next"]/@href').extract()
    if next_page:
        next_href = next_page[0]
        next_page_url = response.urljoin(next_href)
        request = scrapy.Request(url=next_page_url)
        yield request

settings.py:

ITEM_PIPELINES = {
# 'scrapy.pipelines.images.ImagesPipeline': 2,
'mangascraper.pipelines.MyImagesPipeline': 1,
}
IMAGES_STORE = 'images/immortale/'

items.py:

import scrapy

class MangascraperItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()
    image_names = scrapy.Field()

pipeline.py:

import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem

class MyImagesPipeline(ImagesPipeline):
#     def get_media_requests(self, item, info):
#         for img_url in item['image_urls']:
# #            meta = {'filename': item['image_name']}
#             meta = {'item':item}
#             yield scrapy.Request(url=img_url, meta=meta)

    # def file_path(self, request, response=None, info=None):
    #     return scrapy.Request.meta.get('filename','')
    #
    # def get_media_requests(self, item, info):
    #     return [scrapy.Request(url, meta={'filename':item.get('image_name')}) for url in item.get(self.images_urls_field, [])]

    # def file_path(self, request, response=None, info=None):
    #     print "\n" + scrapy.Request.meta['filename'] + "\n"
    #     return scrapy.Request.meta['filename']
    #
    # def get_media_requests(self, item, info):
    #     img_url = item['image_urls'][0]
    #     meta = {'filename': item['image_names']}
    #     yield scrapy.Request(url=img_url, meta=meta)
    #
    # def item_completed(self, results, item, info):
    #     image_paths = [x['path'] for ok, x in results if ok]
    #     if not image_paths:
    #         raise DropItem("Item contains no images")
    #     item['images'] = image_paths
    #     return item

    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
           yield scrapy.Request(url=image_url, meta={'item': item})
    #
    # def file_path(self, request, response=None, info=None):
    #     item = scrapy.Request.meta['item']
    #     image_guid = item['image_names']
    #     print image_guid
    #     return 'full/%s.jpg' % (image_guid)
    #
    def file_path(self, request, response=None, info=None):
        image_guid = scrapy.response.url.split("/")[-3] + "-" + response.url.split("/")[-2]
        return 'full/%s.jpg' % (image_guid)

    # def get_media_requests(self, item, info):
    #     yield scrapy.Request(item['image_urls'])

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        return item

我希望下载并保存的图像的文件名存储在该项目的“ image_names”字段中,该字段的填充方式为“章节页面”格式(1-25、1-26、1-27、2- 1,2-2,依此类推)以保持页面顺序。

如果您对工作流程有任何参考,或者对管道内各个阶段的步骤进行了逐步说明,对我会很有用。

更新1:使用os.rename找到解决方法

pipeline.py:

import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.utils.project import get_project_settings
import os

class MyImagesPipeline(ImagesPipeline):

    IMAGES_STORE = get_project_settings().get("IMAGES_STORE")

    def item_completed(self, result, item, info):
        image_path = [x["path"] for ok, x in result if ok]
        os.rename(self.IMAGES_STORE + "/" + image_path[0], self.IMAGES_STORE + "/full/" + item["image_names"][0] + ".jpg")
        return item
# TODO: try to override file_path instead of os.rename

0 个答案:

没有答案