在Scrapy中获取刮擦图像的路径

时间:2015-06-05 09:55:43

标签: python web-scraping scrapy scrapy-spider

我正在使用默认ImagePipeline的Scrapy编写图像剪贴簿。

一般来说,现在一切都运转良好。 但是我无法获得已删除图像的路径。

items.py

class MyItem(scrapy.Item):
    name        = scrapy.Field()
    type        = scrapy.Field()
    image_urls  = scrapy.Field()
    images      = scrapy.Field()

pipelines.py

class MyPipeline(object):
    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield scrapy.Request(image_url)

    def item_completed(self, results, item, info):
        mage_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
           raise DropItem("Item contains no images")
        item['image_paths'] = image_paths
        return item

myspider.py

import scrapy

from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
from mycrawler.items import MyItem

class VscrawlerSpider(CrawlSpider):
    """docstring for VscrawlerSpider"""
    name = "myspider"
    allowed_domains = ["vesselfinder.com"]
    start_urls = [
        "https://www.vesselfinder.com/vessels?page=1"
    ]
    rules = [
        Rule(LinkExtractor(allow=r'vesselfinder.com/vessels\?page=[1-4]'),
             callback='parse_item', follow=True)
    ]

    def parse_item(self, response):
        ships = response.xpath('//div[@class="items"]/article')

        for ship in ships:
            item = MyItem()

            item['name'] = ship.xpath('div[2]/header/h1/a/text()').extract()[1].strip() 
            item['image_urls'] = [ship.xpath('div[1]/a/picture/img/@src').extract()[0]]
            item['type'] = ship.xpath('div[2]/div[2]/div[2]/text()').extract()[0]

            str = item['image_paths'][0] + item['type'] + item['name']

            yield item

我收到了错误:

  

exceptions.KeyError:' image_paths'。

我尝试使用item['images'][0].path,但仍然会出现一些错误。我不知道这个错误来自哪里?

1 个答案:

答案 0 :(得分:0)

您尚未定义image_paths字段,请定义它:

class MyItem(scrapy.Item):
    # ...
    image_paths = scrapy.Field()

您可能打算使用images字段代替