更改图像管道的命名约定

时间:2016-05-05 04:16:13

标签: python scrapy

更新

这很令人尴尬,但事实证明我原来的管道问题是我忘了在我的设置中激活它。不管怎么说,eLRuLL是对的。

我正处于这样一个阶段:我有一个可以持续检索我感兴趣的信息的蜘蛛,并以我想要的格式将其推出。我希望最后的绊脚石是对我的图像管道保存的文件应用更合理的命名约定。 SHA1哈希工作正常,但我发现使用它真的很不愉快。

我无法解释documentation以找出如何更改命名系统,我没有任何运气盲目地应用this解决方案。在我的过程中,我已经为每个页面提取了一个唯一的标识符;我想用它来命名图像,因为每页只有一个。

图像管道似乎也不尊重我的管道的fields_to_export部分。我想压制图片网址给自己一个更清晰,更易读的输出。如果有人知道如何做到这一点,我将非常感激。

它要从我的解析中提取的唯一标识符为CatalogRecord.add_xpath('accession', './/dd[@class="accession"]/text()')。你会在下面找到我的蜘蛛和我的管道。

蜘蛛:

URL = "http://www.nga.gov/content/ngaweb/Collection/art-object-page.%d"
starting_number = 1315
number_of_pages = 1311
class NGASpider(CrawlSpider):
    name = 'ngamedallions'
    allowed_domains = ['nga.gov']
    start_urls = [URL % i + '.html' for i in range (starting_number, number_of_pages, -1)]
    rules = (
            Rule(LinkExtractor(allow=('art-object-page.*','objects/*')),callback='parse_CatalogRecord',
follow=True
),)



    def parse_CatalogRecord(self, response):
        CatalogRecord = ItemLoader(item=NgamedallionsItem(), response=response)
        CatalogRecord.default_output_processor = TakeFirst()
        CatalogRecord.image_urls_out = scrapy.loader.processors.Identity()
        keywords = "reverse|obverse and (medal|medallion)"
        notkey = "Image Not Available"
        n = re.compile('.*(%s).*' % notkey, re.IGNORECASE|re.MULTILINE|re.UNICODE)
        r = re.compile('.*(%s).*' % keywords, re.IGNORECASE|re.MULTILINE|re.UNICODE)
        if not n.search(response.body_as_unicode()):
            if r.search(response.body_as_unicode()):
                CatalogRecord.add_xpath('title', './/dl[@class="artwork-details"]/dt[@class="title"]/text()')
                CatalogRecord.add_xpath('accession', './/dd[@class="accession"]/text()')
                CatalogRecord.add_xpath('inscription', './/div[@id="inscription"]/p/text()', Join(), re='[A-Z]+')
                CatalogRecord.add_xpath('image_urls', './/img[@class="mainImg"]/@src')
                CatalogRecord.add_xpath('date', './/dt[@class="title"]', re='(\d+-\d+)')

                return CatalogRecord.load_item()

管道:

class NgamedallionsPipeline(object):
  def __init__(self):
     self.files = {}

  @classmethod
  def from_crawler(cls, crawler):
    pipeline = cls()
    crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
    crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
    return pipeline

  def spider_opened(self, spider):
    file = open('%s_items.csv' % spider.name, 'w+b')
    self.files[spider] = file
    self.exporter = CsvItemExporter(file)
    self.exporter.fields_to_export = ['accession', 'title', 'date', 'inscription']
    self.exporter.start_exporting()

  def spider_closed(self, spider):
    self.exporter.finish_exporting()
    file = self.files.pop(spider)
    file.close()

  def process_item(self, item, spider):
    self.exporter.export_item(item)
    return item

1 个答案:

答案 0 :(得分:2)

关于重命名写入磁盘的映像,这是一种方法:

  1. 通过覆盖meta
  2. 为管道生成的图片Request添加get_media_requests()内容
  3. 覆盖file_path()并使用meta
  4. 中的信息

    自定义ImagesPipeline示例:

    import scrapy
    from scrapy.pipelines.images import ImagesPipeline
    
    
    class NgaImagesPipeline(ImagesPipeline):
    
        def get_media_requests(self, item, info):
            # use 'accession' as name for the image when it's downloaded
            return [scrapy.Request(x, meta={'image_name': item["accession"]})
                    for x in item.get('image_urls', [])]
    
        # write in current folder using the name we chose before
        def file_path(self, request, response=None, info=None):
            return '%s.jpg' % request.meta['image_name']
    

    关于导出字段,@ eLRuLL的建议对我有用:

    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy import signals
    from scrapy.exporters import CsvItemExporter
    
    
    class NgaCsvPipeline(object):
        def __init__(self):
            self.files = {}
    
        @classmethod
        def from_crawler(cls, crawler):
            pipeline = cls()
            crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
            crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
            return pipeline
    
        def spider_opened(self, spider):
            ofile = open('%s_items.csv' % spider.name, 'w+b')
            self.files[spider] = ofile
            self.exporter = CsvItemExporter(ofile,
                fields_to_export = ['accession', 'title', 'date', 'inscription'])
            self.exporter.start_exporting()
    
        def spider_closed(self, spider):
            self.exporter.finish_exporting()
            ofile = self.files.pop(spider)
            ofile.close()
    
        def process_item(self, item, spider):
            self.exporter.export_item(item)
            return item