ValueError:请求网址中缺少方案:h

时间:2017-02-14 07:15:10

标签: python-2.7 scrapy scrapinghub

我是scrapy的初学者,python。我试图在scrapinghub中部署蜘蛛代码,我遇到了以下错误。下面是代码。

import scrapy
from bs4 import BeautifulSoup,SoupStrainer
import urllib2
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
import re
import pkgutil
from pkg_resources import resource_string
from tues1402.items import Tues1402Item

data = pkgutil.get_data("tues1402","resources/urllist.txt")
class SpiderTuesday (scrapy.Spider):     
    name = 'tuesday'
    self.start_urls = [url.strip() for url in data]
    def parse(self, response):
       story = Tues1402Item()
       story['url'] = response.url
       story['title'] = response.xpath("//title/text()").extract()
       return story

是我的spider.py代码

import scrapy
class Tues1402Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    url = scrapy.Field() 

是items.py代码和

from setuptools import setup, find_packages
setup(
    name         = 'tues1402',
    version      = '1.0',
    packages     = find_packages(),
    entry_points = {'scrapy': ['settings = tues1402.settings']},
    package_data = {'tues1402':['resources/urllist.txt']},
    zip_safe = False,
)

是setup.py代码。

以下是错误。

  

追踪(最近一次通话):     文件" /usr/local/lib/python2.7/site-packages/scrapy/core/engine.py",第126行,在_next_request中       request = next(slot.start_requests)     文件" /usr/local/lib/python2.7/site-packages/scrapy/spiders/ init .py",第70行,在start_requests中       yield self.make_requests_from_url(url)     文件" /usr/local/lib/python2.7/site-packages/scrapy/spiders/ init .py",第73行,在make_requests_from_url中       return Request(url,dont_filter = True)     文件" /usr/local/lib/python2.7/site-packages/scrapy/http/request/ init .py",第25行, init       self._set_url(URL)     文件" /usr/local/lib/python2.7/site-packages/scrapy/http/request/ init .py",第57行,在_set_url中       引发ValueError('请求网址中缺少方案:%s'%self._url)   ValueError:请求url中缺少方案:h

先谢谢你

1 个答案:

答案 0 :(得分:0)

您的错误表示网址h不是有效的网址。您应该打印出self.start_urls并查看您在那里的网址,您最有可能使用字符串h作为您的第一个网址。

好像你的蜘蛛在这里迭代文本而不是网址列表:

data = pkgutil.get_data("tues1402","resources/urllist.txt")
class SpiderTuesday (scrapy.Spider):     
    name = 'tuesday'
    self.start_urls = [url.strip() for url in data]

假设您在@ {1}}文件中存储了一些带有分隔符的网址,您应该将其拆分:

urllist.txt