<get%22http:=“”www.astate.edu =“”%22 =“”>:不支持的URL方案'':Scrapy中没有可用于该方案的处理程序</get>

时间:2012-11-08 09:41:31

标签: python web-crawler scrapy

我在scrapy框架中遇到此错误。这是我在spiders目录下的dmoz.py:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from dirbot.items import Website


class DmozSpider(BaseSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    f = open("links.csv")
    start_urls = [url.strip() for url in f.readlines()]
    f.close()
    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//ul/li')
        items = []

        for site in sites:
            item = Website()
            item['name'] = site.select('a/text()').extract()
            item['url'] = site.select('a/@href').extract()
            item['description'] = site.select('text()').extract()
            items.append(item)

        return items

运行此代码时出现此错误:

<GET %22http://www.astate.edu/%22>: Unsupported URL scheme '': no handler available for that scheme in Scrapy

这是我对links.csv的内容:

http://www.atsu.edu/
http://www.atsu.edu/
http://www.atsu.edu/
http://www.atsu.edu/
http://www.atsu.edu/
http://www.atsu.edu/
http://www.atsu.edu/
http://www.atsu.edu/
http://www.atsu.edu/
http://www.atsu.edu/
http://www.atsu.edu/
http://www.atsu.edu/
http://www.atsu.edu/
http://www.atsu.edu/
http://www.atsu.edu/
http://www.atsu.edu/

links.csv中有80个网址。如何解决此错误?

1 个答案:

答案 0 :(得分:4)

%22 is " urlencoded。您的CSV文件可能包含以下行:

"http://example.com/"
  1. 使用csv module读取文件,或
  2. 剥离" s。
  3. 修改:根据要求:

    '"http://example.com/"'.strip('"')
    

    编辑2:

    import csv
    from StringIO import StringIO
    
    c = '"foo"\n"bar"\n"baz"\n'      # Since csv.reader needs a file-like-object,
    reader = csv.reader(StringIO(c)) # wrap c into a StringIO.
    for line in reader:
        print line[0]
    

    最后编辑:

    import csv
    
    with open("links.csv") as f:
        r = csv.reader(f)
        start_urls = [l[0] for l in r]