我正试图让scrapy从wikia网站上的表中下载图像,但它一直在给我" ValueError:在请求网址中缺少方案:h"当我通过命令行运行项目时。
zhimagespider.py
# -*- coding: utf-8 -*-
import scrapy
from zh.pipelines import ZhImagesPipeline
from zh.items import ImageItem
from utils import get_raw_image
class ZhImageSpider(scrapy.Spider):
name = 'zh'
allowed_domains = ['https://zh.battlegirl.wikia.com',
'https://vignette.wikia.nocookie.net/']
start_urls = ['https://zh.battlegirl.wikia.com/wiki/%E5%8D%A1%E7%89%87%E4%B8%80%E8%A6%BD']
def parse(self, response):
for row in response.xpath("//tr")[2:]:
# Initialize dictionary
item = ImageItem()
item['image_id'] = row.xpath('td[1]/text()').extract_first()
# Get icons
icons = row.css('td:nth-child(2)').xpath('.//@src').extract()
for icon in icons:
if icon.startswith('d'): # Or 'data'
icons.remove(icon)
item['image_urls'] = get_raw_image(icons[0])
yield item
追溯样本
2017-10-28 22:48:37 [scrapy.core.scraper]错误:处理错误{' image_id':' 1', ' image_urls':' https://vignette.wikia.nocookie.net/battlegirl/images/8/86/Card_10011_s.png/revision/latest?cb=20160212023217&path-prefix=zh'}
追踪(最近一次呼叫最后一次):
文件" C:\ Miniconda3 \ lib \ site-packages \ twisted \ internet \ defer.py",第653行,在_runCallbacks中 current.result = callback(current.result,* args,** kw)
文件" C:\ Miniconda3 \ lib \ site-packages \ scrapy \ pipelines \ media.py",第79行,在process_item中 requests = arg_to_iter(self.get_media_requests(item,info))
文件" C:\ Miniconda3 \ lib \ site-packages \ scrapy \ pipelines \ images.py",第152行,在get_media_requests中 在item.get中返回[Request(x)for x(self.images_urls_field,[])]
文件" C:\ Miniconda3 \ lib \ site-packages \ scrapy \ pipelines \ images.py",第152行,in 在item.get中返回[Request(x)for x(self.images_urls_field,[])]
文件" C:\ Miniconda3 \ lib \ site-packages \ scrapy \ http \ request__init __。py",第25行, init self._set_url(URL)
文件" C:\ Miniconda3 \ lib \ site-packages \ scrapy \ http \ request__init __。py",第58行,在_set_url 提出ValueError('请求网址中缺少方案:%s'%self._url)
ValueError:请求网址中缺少方案:h
这是项目中的其他脚本:
items.py
# -*- coding: utf-8 -*-
import scrapy
class ImageItem(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()
image_id = scrapy.Field()
pipelines.py
# -*- coding: utf-8 -*-
import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
class ZhImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield scrapy.Request(image_url)
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
item['image_paths'] = image_paths
return item
utils.py 只是一个脚本,用于删除图片网址的一部分,该部分会调整wikia上的图标大小:
def get_raw_image(url):
splitted = url.split('?')
if len(splitted) == 2:
return "?".join(["/".join(splitted[0].split("/")[0:-2])] +
[splitted[1]])
elif len(splitted) == 1:
return url
else:
raise ValueError('Not a resized Vignette image url: %s' %url)
似乎脚本正在将网址作为字符值读取,但我不确定为什么?
答案 0 :(得分:0)
item['image_urls']
需要list
,而在您的情况下,您将其存储为字符串。这就是为什么当你在管道中循环它时,你会循环遍历单个字符,从字母h
开始。这是image_url
在您产生新的Request
并因此产生错误时所包含的内容。