所以我正在构建一个导入.csv excel文件的刮刀,该文件有一行~2,400个网站(每个网站都在自己的列中)并使用这些作为start_url。我一直收到这个错误,说我传入的是列表而不是字符串。我认为这可能是因为我的列表基本上只有一个代表行的真实长列表。我怎样才能克服这一点,基本上把我的.csv中的每个网站作为自己的单独字符串放在列表中?
raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__)
exceptions.TypeError: Request url must be str or unicode, got list:
import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.http import HtmlResponse
from tutorial.items import DanishItem
from scrapy.http import Request
import csv
with open('websites.csv', 'rbU') as csv_file:
data = csv.reader(csv_file)
scrapurls = []
for row in data:
scrapurls.append(row)
class DanishSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = []
start_urls = scrapurls
def parse(self, response):
for sel in response.xpath('//link[@rel="icon" or @rel="shortcut icon"]'):
item = DanishItem()
item['website'] = response
item['favicon'] = sel.xpath('./@href').extract()
yield item
谢谢!
乔伊
答案 0 :(得分:2)
仅为start_urls
生成列表不起作用,因为它明确写在Scrapy documentation中。
来自文档:
首先生成初始请求以抓取第一个URL,然后指定要使用从这些请求下载的响应调用的回调函数。
首先要求执行的请求是通过调用
start_requests()
方法(默认情况下)为Request
生成start_urls
parse
和def get_urls_from_csv(): with open('websites.csv', 'rbU') as csv_file: data = csv.reader(csv_file) scrapurls = [] for row in data: scrapurls.append(row) return scrapurls class DanishSpider(scrapy.Spider): ... def start_requests(self): return [scrapy.http.Request(url=start_url) for start_url in get_urls_from_csv()]
方法中指定的网址为 请求的回调函数。
我宁愿这样做:
{{1}}
答案 1 :(得分:1)
尝试在类中打开.csv文件(不像之前那样在外面)并附加start_urls。这个解决方案对我有用。希望这会有所帮助: - )
class DanishSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = []
start_urls = []
f = open('websites.csv'), 'r')
for i in f:
u = i.split('\n')
start_urls.append(u[0])
答案 2 :(得分:0)
for row in data:
scrapurls.append(row)
row
是一个列表[column1,column2,..]
所以我认为你需要提取列,并附加到你的start_urls。
for row in data:
# if all the column is the url str
for column in row:
scrapurls.append(column)
答案 3 :(得分:0)
也尝试这种方式,
filee = open("filename.csv","r+")
# Removing the \n 'new line' from the url
r=[i for i in filee]
start_urls=[r[j].replace('\n','') for j in range(len(r))]
答案 4 :(得分:0)
在需要时我发现以下有用的东西
import csv
import scrapy
class DanishSpider(scrapy.Spider):
name = "rei"
with open("output.csv","r") as f:
reader = csv.DictReader(f)
start_urls = [item['Link'] for item in reader]
def parse(self, response):
yield {"link":response.url}