Scrapy - 将Excel .csv导入为start_url

时间:2014-12-17 01:34:20

标签: python excel csv web-scraping scrapy

所以我正在构建一个导入.csv excel文件的刮刀,该文件有一行~2,400个网站(每个网站都在自己的列中)并使用这些作为start_url。我一直收到这个错误,说我传入的是列表而不是字符串。我认为这可能是因为我的列表基本上只有一个代表行的真实长列表。我怎样才能克服这一点,基本上把我的.csv中的每个网站作为自己的单独字符串放在列表中?

raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__)
    exceptions.TypeError: Request url must be str or unicode, got list:


import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.http import HtmlResponse
from tutorial.items import DanishItem
from scrapy.http import Request
import csv

with open('websites.csv', 'rbU') as csv_file:
  data = csv.reader(csv_file)
  scrapurls = []
  for row in data:
    scrapurls.append(row)

class DanishSpider(scrapy.Spider):
  name = "dmoz"
  allowed_domains = []
  start_urls = scrapurls

  def parse(self, response):
    for sel in response.xpath('//link[@rel="icon" or @rel="shortcut icon"]'):
      item = DanishItem()
      item['website'] = response
      item['favicon'] = sel.xpath('./@href').extract()
      yield item

谢谢!

乔伊

5 个答案:

答案 0 :(得分:2)

仅为start_urls生成列表不起作用,因为它明确写在Scrapy documentation中。

来自文档:

  

首先生成初始请求以抓取第一个URL,然后指定要使用从这些请求下载的响应调用的回调函数。

     

首先要求执行的请求是通过调用   start_requests()方法(默认情况下)为Request生成start_urls   parsedef get_urls_from_csv(): with open('websites.csv', 'rbU') as csv_file: data = csv.reader(csv_file) scrapurls = [] for row in data: scrapurls.append(row) return scrapurls class DanishSpider(scrapy.Spider): ... def start_requests(self): return [scrapy.http.Request(url=start_url) for start_url in get_urls_from_csv()] 方法中指定的网址为   请求的回调函数。

我宁愿这样做:

{{1}}

答案 1 :(得分:1)

尝试在类中打开.csv文件(不像之前那样在外面)并附加start_urls。这个解决方案对我有用。希望这会有所帮助: - )

    class DanishSpider(scrapy.Spider):
        name = "dmoz"
        allowed_domains = []
        start_urls = []

        f = open('websites.csv'), 'r')
        for i in f:
        u = i.split('\n')
        start_urls.append(u[0])

答案 2 :(得分:0)

  for row in data:
    scrapurls.append(row)

row是一个列表[column1,column2,..] 所以我认为你需要提取列,并附加到你的start_urls。

  for row in data:
      # if all the column is the url str
      for column in row:
          scrapurls.append(column)

答案 3 :(得分:0)

也尝试这种方式,

filee = open("filename.csv","r+")

# Removing the \n 'new line' from the url

r=[i for i in filee]
start_urls=[r[j].replace('\n','') for j in range(len(r))]

答案 4 :(得分:0)

在需要时我发现以下有用的东西

import csv
import scrapy

class DanishSpider(scrapy.Spider):
    name = "rei"
    with open("output.csv","r") as f:
        reader = csv.DictReader(f)
        start_urls = [item['Link'] for item in reader]

    def parse(self, response):
        yield {"link":response.url}