Scrapy读取网址为char值? (ValueError:请求网址中缺少方案:h)

时间:2017-10-29 04:15:24

标签: python scrapy python-3.6

我正试图让scrapy从wikia网站上的表中下载图像,但它一直在给我" ValueError:在请求网址中缺少方案:h"当我通过命令行运行项目时。

zhimagespider.py

 # -*- coding: utf-8 -*-
import scrapy

from zh.pipelines import ZhImagesPipeline
from zh.items import ImageItem

from utils import get_raw_image

class ZhImageSpider(scrapy.Spider):
    name = 'zh'
    allowed_domains = ['https://zh.battlegirl.wikia.com',
                       'https://vignette.wikia.nocookie.net/']
    start_urls = ['https://zh.battlegirl.wikia.com/wiki/%E5%8D%A1%E7%89%87%E4%B8%80%E8%A6%BD']

    def parse(self, response):
        for row in response.xpath("//tr")[2:]:
            # Initialize dictionary
            item = ImageItem()

            item['image_id'] = row.xpath('td[1]/text()').extract_first()

            # Get icons
            icons = row.css('td:nth-child(2)').xpath('.//@src').extract()
            for icon in icons:
                if icon.startswith('d'): # Or 'data'
                    icons.remove(icon)

            item['image_urls'] = get_raw_image(icons[0])

            yield item

追溯样本

  

2017-10-28 22:48:37 [scrapy.core.scraper]错误:处理错误{' image_id':' 1',    ' image_urls':' https://vignette.wikia.nocookie.net/battlegirl/images/8/86/Card_10011_s.png/revision/latest?cb=20160212023217&path-prefix=zh'}

     

追踪(最近一次呼叫最后一次):

     

文件" C:\ Miniconda3 \ lib \ site-packages \ twisted \ internet \ defer.py",第653行,在_runCallbacks中       current.result = callback(current.result,* args,** kw)

     

文件" C:\ Miniconda3 \ lib \ site-packages \ scrapy \ pipelines \ media.py",第79行,在process_item中       requests = arg_to_iter(self.get_media_requests(item,info))

     

文件" C:\ Miniconda3 \ lib \ site-packages \ scrapy \ pipelines \ images.py",第152行,在get_media_requests中       在item.get中返回[Request(x)for x(self.images_urls_field,[])]

     

文件" C:\ Miniconda3 \ lib \ site-packages \ scrapy \ pipelines \ images.py",第152行,in       在item.get中返回[Request(x)for x(self.images_urls_field,[])]

     

文件" C:\ Miniconda3 \ lib \ site-packages \ scrapy \ http \ request__init __。py",第25行, init       self._set_url(URL)

     

文件" C:\ Miniconda3 \ lib \ site-packages \ scrapy \ http \ request__init __。py",第58行,在_set_url       提出ValueError('请求网址中缺少方案:%s'%self._url)

     

ValueError:请求网址中缺少方案:h

这是项目中的其他脚本:

items.py

# -*- coding: utf-8 -*-

import scrapy

class ImageItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()
    image_id = scrapy.Field()

pipelines.py

# -*- coding: utf-8 -*-

import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem

class ZhImagesPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield scrapy.Request(image_url)

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item['image_paths'] = image_paths
        return item

utils.py 只是一个脚本,用于删除图片网址的一部分,该部分会调整wikia上的图标大小:

def get_raw_image(url):

    splitted = url.split('?')
    if len(splitted) == 2:
        return "?".join(["/".join(splitted[0].split("/")[0:-2])] + 
        [splitted[1]])
    elif len(splitted) == 1:
        return url
    else:
        raise ValueError('Not a resized Vignette image url: %s' %url)

似乎脚本正在将网址作为字符值读取,但我不确定为什么?

1 个答案:

答案 0 :(得分:0)

根据文档,

item['image_urls']需要list,而在您的情况下,您将其存储为字符串。这就是为什么当你在管道中循环它时,你会循环遍历单个字符,从字母h开始。这是image_url在您产生新的Request并因此产生错误时所包含的内容。