我在scrapy框架中遇到此错误。这是我在spiders目录下的dmoz.py:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from dirbot.items import Website
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
f = open("links.csv")
start_urls = [url.strip() for url in f.readlines()]
f.close()
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul/li')
items = []
for site in sites:
item = Website()
item['name'] = site.select('a/text()').extract()
item['url'] = site.select('a/@href').extract()
item['description'] = site.select('text()').extract()
items.append(item)
return items
运行此代码时出现此错误:
<GET %22http://www.astate.edu/%22>: Unsupported URL scheme '': no handler available for that scheme in Scrapy
这是我对links.csv的内容:
http://www.atsu.edu/
http://www.atsu.edu/
http://www.atsu.edu/
http://www.atsu.edu/
http://www.atsu.edu/
http://www.atsu.edu/
http://www.atsu.edu/
http://www.atsu.edu/
http://www.atsu.edu/
http://www.atsu.edu/
http://www.atsu.edu/
http://www.atsu.edu/
http://www.atsu.edu/
http://www.atsu.edu/
http://www.atsu.edu/
http://www.atsu.edu/
links.csv中有80个网址。如何解决此错误?
答案 0 :(得分:4)
%22
is "
urlencoded。您的CSV文件可能包含以下行:
"http://example.com/"
csv
module读取文件,或"
s。修改:根据要求:
'"http://example.com/"'.strip('"')
编辑2:
import csv
from StringIO import StringIO
c = '"foo"\n"bar"\n"baz"\n' # Since csv.reader needs a file-like-object,
reader = csv.reader(StringIO(c)) # wrap c into a StringIO.
for line in reader:
print line[0]
最后编辑:
import csv
with open("links.csv") as f:
r = csv.reader(f)
start_urls = [l[0] for l in r]